How good can you be at Codenames without knowing any words?

Sun, 11 Aug 2024 00:00:00 +0000

About eight years ago, I was playing a game of Codenames where the game state was such that our team would almost certainly lose if we didn't correctly guess all of our remaining words on our turn. From the given clue, we were unable to do this. Although the game is meant to be a word guessing game based on word clues, a teammate suggested that, based on the physical layout of the words that had been selected, most of the possibilities we were considering would result in patterns that were "too weird" and that we should pick the final word based on the location. This worked and we won.

[Click to expand explanation of Codenames if you're not familiar with the game]

Codenames is played in two teams. The game has a 5x5 grid of words, where each word is secretly owned by one of {blue team, red team, neutral, assassin}. Each team has a "spymaster" who knows the secret word <-> ownership mapping. The spymaster's job is to give single-word clues that allow their teammates to guess which words belong to their team without accidentally guessing words of the opposing team or the assassin. On each turn, the spymaster gives a clue and their teammates guess which words are associated with the clue. The game continues until one team's words have all been guessed or the assassin's word is guessed (immediate loss). There are some details that are omitted here for simplicity, but for the purposes of this post, this explanation should be close enough. If you want more of an explanation, you can try this video, or the official rules

Ever since then, I've wondered how good someone would be if all they did was memorize all 40 setup cards that come with the game. To simulate this, we'll build a bot that plays using only position information would be (you might also call this an AI, but since we'll discuss using an LLM/AI to write this bot, we'll use the term bot to refer to the automated codenames playing agent to make it easy to disambiguate).

At the time, after the winning guess, we looked through the configuration cards to see if our teammate's idea of guessing based on shape was correct, and it was — they correctly determined the highest probability guess based on the possible physical configurations. Each card layout defines which words are your team's and which words belong to the other team and, presumably to limit the cost, the game only comes with 40 cards (160 configurations under rotation). Our teammate hadn't memorized the cards (which would've narrowed things down to only one possible configuration), but they'd played enough games to develop an intuition about what patterns/clusters might be common and uncommon, enabling them to come up with this side-channel attack against the game. For example, after playing enough games, you might realize that there's no card where a team has 5 words in a row or column, or that only the start player color ever has 4 in a row, and if this happens on an edge and it's blue, the 5th word must belong to the red team, or that there's no configuration with six connected blue words (and there is one with red, one with 2 in a row centered next to 4 in a row). Even if you don't consciously use this information, you'll probably develop a subconscious aversion to certain patterns that feel "too weird".

Coming back to the idea of building a bot that simulates someone who's spent a few days memorizing the 40 cards, below, there's a simple bot you can play against that simulates a team of such players. Normally, when playing, you'd provide clues and the team would guess words. But, in order to provide the largest possible advantage to you, the human, we'll give you the unrealistically large advantage of assuming that you can, on demand, generate a clue that will get your team to select the exact squares that you'd like, which is simulated by letting you click on any tile that you'd like to have your team guess that tile.

By default, you also get three guesses a turn, which would put you well above 99%-ile among Codenames players I've seen. While good players can often get three or more correct moves a turn, averaging three correct moves and zero incorrect moves a turn would be unusually good in most groups. You can toggle the display of remaining matching boards on, but if you want to simulate what it's like to be a human player who hasn't memorized every board, you might want to try playing a few games with the display off.

If, at any point, you finish a turn and it's the bot's turn and there's only one matching board possible, the bot correctly guesses every one of its words and wins. The bot would be much stronger if it ever guessed words before it can guess them all, either naively or to strategically reduce the search space, or if it even had a simple heuristic where it would randomly guess among the possible boards if it could deduce that you'd win on your next turn, but even when using the most naive "board memorization" bot possible has been able to beat every Codenames player who I handed this to in most games where they didn't toggle the remaining matching boards on and use the same knowledge the bot has access to.

JS for the Codenames bot failed to load!

This very silly bot that doesn't guess until it can guess everything is much stronger than most Codenames players¹. In practice, any team with someone who decides to sit down and memorize the contents of the 40 initial state cards that come in the box will beat the other team in basically every game.

Now that my curiosity about this question is satisfied, I think this is a minor issue and not really a problem for the game because word guessing games are generally not meant to be taken seriously and most of them end up being somewhat broken if people take them seriously or even if people just play them a lot and aren't trying to break the game. Relative to other word guessing games, and especially relative to popular ones, Codenames has a lot more replayability before players will start using side channel attacks, subconsciously or otherwise.

What happens with games with a limited set of words, like Just One or Taboo, is that people end up accidentally memorizing the words and word associations for "tricky" words after a handful of plays. Codenames mitigates this issue by effectively requiring people to memorize a combinatorially large set of word associations instead of just a linear number of word associations. There's this issue we just discussed, which came up when we were twenty-ish games into playing Codenames and is likely to happen on a subconscious level even if people don't realize that board shapes are influencing their play, but this is relatively subtle compared to the issues that come up in other word guessing games. And, if anyone really cares about this issue, they can use a digital randomizer to set up their boards, although I've never personally played Codenames in a group that's serious enough about the game for anyone to care to do this.

Thanks to Josh Bleecher Snyder, David Turner, Won Chun, Laurence Tratt, Heath Borders, Spencer Griffin, and Yossi Kreinin for comments/corrections/discusison

Appendix: writing the code for the post

I tried using two different AI assistants to write the code for this post, Storytell and Cursor. I didn't use them as a programmer would use them and more used them as a non-programmer would use them to write a program. Overall, I find AI assistants to be amazingly good at some tasks while being hilariously bad at other tasks. That was the case here as well.

I basically asked them to write code and then ran it to see if it worked and would then tell the assistant what was wrong and have it re-write the code until it looked like that basically worked. Even using the assistants in this very naive way, where I deliberately avoided understanding the code and was only looking to get output that worked, I don't think it took too much longer to get working code than it would've taken if I just coded up the entire thing by hand with no assistance. I'm going to guess that it took about twice as long, but programmer estimates are notoriously inaccurate and for all I know it was a comparable amount of time. I have much less confidence that the code is correct and I'd probably have to take quite a bit more time to be as confident as I'd be if I'd written the code, but I still find it fairly impressive that you can just prompt these AI assistants and get code that basically works out in not all that much more time than it would take a programmer to write the code. These tools are certainly much cheaper than hiring a programmer and, if you're using one of these tools as a programmer and not as a naive prompter, you'd get something working much more quickly because you can simply fix the bugs in one of the mostly correction versions instead of spending most of your time tweaking what you're asking for to get the AI to eliminate a bug that would be trivial for any programmer to debug and fix.

I've seen a lot of programmers talk about how "AI" will never be able to replace programmers with reasons like "to specify a program in enough detail that it does what you want, you're doing programming". If the user has to correctly specify how the program works up front, that would be fairly strong criticism, but when the user can iterate, like we did here, this is a much weaker criticism. The user doesn't need to be a programmer to observe that an output is incorrect, at which point the user can ask the AI to correct the output, repeating this process until the output seems correct enough. The more a piece of software has strict performance or correctness constraints, the less well this kind of naive iteration works. Luckily for people wanting to use LLMs to generate code, most software that's in production today has fairly weak performance and correctness constraints. People basically just accept that software has a ton of bugs and that it's normal to run into hundreds or thousands of software bugs in any given week and that widely used software is frequently 100000x slower than it could be if it were highly optimized.

A moderately close analogy is the debate over whether or not AI could ever displace humans in support roles. Even as this was already happening, people would claim that this could never happen because AI makes bad mistakes that humans don't make. But as we previously noted, humans frequently make the same mistakes. Moreover, even if AI support is much worse, as long as the price:performance ratio is good enough, a lot of companies will choose the worse, but cheaper, option. Tech companies have famously done this for consumer support of all kinds, but we commonly see this for all sorts of companies, e.g., when you call support for any large company or even lots of local small businesses, it's fairly standard to get a pushed into a phone tree or some kind of bad automated voice recognition that's a phone tree replacement. These are generally significantly worse than a minimum wage employee, but the cost is multiple orders of magnitude lower than having a minimum wage employee pick up every call and route you to the right department, so companies have chosen the phone tree.

The relevant question isn't "when will AI allow laypeople to create better software than programmers?" but "when will AI allow laypeople to create software that's as good as phone trees and crappy voice recognition are for customer support?". And, realistically, the software doesn't even have to be that good because programmers are more expensive than minimum wage support folks, but you can get access to these tools for $20/mo. I don't know how long it will be before AI can replace a competent programmer, but if the minimum bar is to be as good at programming as automated phone tree systems are at routing my calls, I think we should get there soon if we're not already there. And, as with customer support, this doesn't have to be zero sum. Not all of the money that's saved from phone trees is turned into profit — some goes into hiring support people who handle other tasks.

BTW, one thing that I thought was a bit funny about my experience was that both platforms I tried, Storytell and Cursor, would frequently generate an incorrect result that could've been automatically checked, which it would then fix when I pointed out that the result was incorrect. Here's a typical sequence of interactions with one of these platforms:

Me: please do X
AI: [generates some typescript code and tests which fails to typecheck]
Me: this code doesn't typecheck, can you fix this?
AI: [generates some code and tests which fail when the tests are executed]
Me: the tests fail with [copy+paste test failure] when run
AI: [generates some code and tests which pass and also seems to work on some basic additional tests]

Another funny interaction was that I'd get in a loop where there were a few different bugs and asking the AI to fix one would reintroduce the other bugs even when specifically asking the AI to not reintroduce those other bugs. Compared to anyone who's using these kinds of tools day in and day out, I have very little experience with them (I just mess with them occasionally to see how much they've progressed) and I'd expect someone with more prompting experience to be able to specify prompts that break out of these sorts of loops more quickly than I was able to.

But, even so, it would be nicer experience if one of these environments had access to an execution environment so they could actually automatically fix these kinds of issues (when they're fixable) and could tell if the output was known to be wrong when a bit of naive re-prompting with "that was wrong and caused XYZ, please fix" doesn't fix the issue.

I asked Josh Bleecher Snyder, who's much more familiar with this space than I am (both technically as well as on the product side) why none of these tools do that and almost none of the companies do training or fine tuning with such an environment and his response was that almost everyone working in the space has bought into The Bitter Lesson and isn't working on these sorts of mundane improvements. The idea is that the kind of boring engineering work that would be necessary to set up an environment like the above will be obsoleted by some kind of fundamental advancement, so it's a waste of time to work on these kinds of things that give you incremental gains. Sam Altman has even advised founders of companies that are relying on OpenAI APIs to assume that there will be huge improvements and build companies that assume this because the companies that don't will get put out of business by the massive improvements that are coming soon. From discussions with founders and VCs in this space, almost everyone has taken this to heart.

I haven't done any serious ML-related work for 11 years, so my opinion is worth about as much as any other layperson's, but if someone had made the contrarian bet on such a mundane system in the GPT-3 days, it seems like it would've been useful then and would still be useful with today's models, both for training/fine-tuning work as well for generating better output for the user. But I guess the relevant question is, would it make sense to try to build such a mundane system today which would be, for someone working in the space, a contrarian bet against progress? The big AI labs supposedly have a bunch of low-paid overseas contractors who label things, but if you want to label programming examples, per label, an environment that produces the canonical correct result is going to be cheaper than paying someone to try to label it unless you only want a tiny number of labels. At the level of a $1T or even $50B company, it seems like it should make sense to make the bet as a kind of portfolio move. If I want to start a startup and make a bet, then would it make sense? Maybe it's less obvious if you're putting all your eggs in one basket, but even then, perhaps there's a good case for it because almost the entire field is betting on something else? If the contrarian side is right, there's very little competition, which seems somewhat analogous to our previous discussion on contrarian hiring. I haven't done any serious ML-related work for 11 years, so my opinion is worth about as much as any other layperson's, but if someone had made the contrarian bet on such a mundane system in the GPT-3 days, it seems like it would've been useful then and would still be useful with today's models, both for training/fine-tuning work as well for generating better output for the user. But I guess the relevant question is, would it make sense to try to build such a mundane system today which would be, for someone working in the space, a contrarian bet against progress? At the level of a $1T or even $50B company, it seems like it should make sense to make the bet as a kind of portfolio move. And the big AI labs supposedly have a bunch of low-paid overseas contractors who label things, but if you want to label programming examples, per label, an environment that produces the canonical correct result is going to be cheaper than paying someone to try to label it unless you only want a tiny number of labels. If I want to start a startup and make a bet, then would it make sense? Maybe it's less obvious if you're putting all your eggs in one basket, but even then, perhaps there's a good case for it because almost the entire field is betting on something else? If the contrarian side is right, there's very little competition, which seems somewhat analogous to our previous discussion on contrarian hiring.

Appendix: the spirit of the game vs. playing to win

Personally, when I run into a side-channel attack in a game or a game that's just totally busted if played to win, like Perfect Words, I think it makes sense to try to avoid "attacking" the game to the extent possible. I think this is sort of impossible to do perfectly in Codenames because people will form subconscious associations (I've noticed people guessing an extra word on the first turn just to mess around, which works more often than not — assuming they're not cheating, and I believe they're not cheating, the success rate strongly suggests the use some kind of side-channel information. That doesn't necessarily have to be positional information from the cards, it could be something as simple as subconsciously noticing what the spymasters are intently looking at.

Dave Sirlin calls anyone who doesn't take advantage of any legal possibility to win is a sucker (he derogatorily calls such people "scrubs") (he says that you should use cheats to win, like using maphacks in FPS games, as long as tournament organizers don't ban the practice, and that tournaments should explicitly list what's banned, avoiding generic "don't do bad stuff" rules). I think people should play games however they find it fun and should find a group that likes playing games in the same way. If Dave finds it fun to memorize arbitrary info to win all of these games, he should do that. The reason I, as Dave Sirlin would put it, play like a scrub, for the kinds of games discussed here is because the games are generally badly broken if played seriously and I don't personally find the ways in which they're broken to be fun. In some cases, like Perfect Words, the game is trivially broken and I find it boring to win a game that's trivially broken. In other cases, like Codenames, the game could be broken by spending a few hours memorizing some arbitrary information. To me, spending a few hours memorizing the 40 possible Codenames cards seems like an unfun and unproductive use of time, making it a completely pointless activity.

Appendix: word games you might like

If you like word guessing games, here are some possible recommendations in the same vein list of programming book recommendations and this list of programming blog recommendations, where the goal is to point out properties of things that people tend to like and dislike (as opposed to most reviews I see, which tend to about whether or not something is "good" or "bad"). To limit the length of this list, this only contains word guessing games, which tend to be about the meaning of words, and doesn't include games that are about the mechanics of manipulating words rather than the meaning, such as Banagrams, Scrabble, or Anagrams, or games that are about the mapping between visual representations and words, such as Dixit or Codenames: Pictures.

Also for reasons of space, I won't discuss reasons people dislike games that apply to all or nearly all games in this list. For example, someone might dislike a game because it's a word game, but there's little point in noting this for every game. Similarly, many people choose games based on "weight" and dislike almost all word games because they feel "light" instead of "heavy", but all of these games are considered fairly light, so there's no point in discussing this (but if you want a word game that's light and intense, in the list below, you might consider Montage or Decrypto, and among games not discussed in detail, Scrabble or Anagrams, the latter of which is the most brutal word game I've ever played by a very large margin).

Taboo

A word guessing game where you need to rapidly give clues to get your teammates to guess what word you have, where each word also comes with a list of 5 stop words you're not allowed to say while clueing the word.

A fun, light game, with two issues that give it low replayability:

Since each word clued is a fully independent way, once your game group has run through the deck once or twice and everyone knows every word, the game becomes extremely easy; in the group I first played this in, I think this happened after we played it twice
Even before that happens, when people realize that you can clue any word fairly easily by describing it in a slightly roundabout way, the game becomes fairly rote even before you accidentally remember the words just from playing too much

When people dislike this game, they often don't like that there's so much time pressure in this rapid fire game.

Just One

A word guessing game that's a bit like Taboo, in that you need to get your team to guess a word, but instead of having a static list of stop words for each word you want to clue, the stop words are dynamically generated by your team (everyone clues one word, and any clue that's been given more than once is stricken).

That stop words are generated via interaction with your teammates gives this game much more replayability than Taboo. However, the limited word list ultimately runs into the same problem and my game group would recognize the words and have a good way to give clues for almost every word after maybe 20-40 plays.

A quirk of the rules as written is that the game is really made for 5+ players and becomes very easy if you play with 4, but there's no reason you couldn't play this game with the 5 player games when you have 4 players.

A common complaint about this game is that the physical components are cheap and low quality considering the cost of the game ($30 MSRP vs. $20 for Codenames). Another complaint is that the words have wildly varying difficulties, some seemingly by accident. For exmaple, the word "grotto" is included and quite hard to clue if someone hasn't seen it, seemingly because the game was developed in French, where grotto would be fairly straightforward to clue.

Perfect Words(https://www.amazon.co.uk/TIKI-Editions-Perfect-Words-Intergenerational/dp/B0CHN8XP1F)

A word guessing game where the team cooperatively constructs clues where the goal is to get the entire team to agree on the word (which can be any arbitrary word as long as people agree) from each set of clues.

The core game, trying to come up with a set of words that will generate agreement on what word they represent, makes for a nice complement to a game that's sort of the opposite, like Just One, but the rules as implemented seem badly flawed. It's as if the game designers don't play games and didn't have people who play games playtest it. The game is fairly trivial to break on your first or second play and you have to deliberately play the "gamey" part of the game badly to make the game interesting

Montage

A 2 on 2 word game (although you can play Codenames style if you want more players). On each team, players alternate fixed time periods of giving clues and guessing words. The current board state has some constraints on what letters must appear in certain position of the word. The cluer needs to generate a clue which will get the guesser to guess their word that fits within the constraints, but the clue can't be too obvious because if both opponents guess the word before the cluer's partner, the opponents win the word.

Perhaps the hardest game on this list? Most new players I've seen fail to come up with valid clue during their turn on their first attempt (a good player can probably clue at least 5 things successfully per turn, if their partner is able to catch the reasoning faster than opponents). This is probably also the game that rewards having a large vocabulary the most of all the games on this list. It's also the only game on this list which exercises the skill of being able to think about the letter composition of words is useful, a la Scrabble.

As long as you're not playing with a regular partner and relying on "secret" agreements or shared knowledge, the direct adversarial nature of guessing gives this game very high replayability, at least as high as anything else on this list.

Like Perfect Words, the core word game is fun if you're into that kind of thing, but the rules of the game that's designed around the core game don't seem to have been very well thought through and can easily be gamed. It's not as bad here as in Perfect Words, but you still have to avoid trying to win to make this game really work.

When I've seen people dislike this game, it's usually because they find the game too hard, or they don't like losing — a small difference in skill results in a larger difference in outcomes than we see in other games in this list, so a new player should expect to lose very badly unless their opponents handicap themselves (which isn't built into the rules) or they have a facility for word games from having played other games. I don't play a lot of word games and I especially don't play a lot of "serious" word games like Scrabble or Anagrams, so I generally get shellacked when I play this, which is part of the appeal for me, but that's exactly what a lot of people don't like about the game.

Word Blur

A word guessing game where the constraint is that you need to form clues from the 900 little word tiles that are spread on the table in front of you.

I've only played it a few times because I don't know anyone local who's managed to snag a copy, but it seemed like it has at least as much replayability as any game on this list. The big downside of this game is that it's been out of print for over a decade and it's famously hard to get ahold of a copy, although it seems like it shouldn't be too difficult to make a clone.

When people dislike this game, it often seems to be because they dislike the core gameplay mechanic of looking at a bunch of word tiles and using them to make a description, which some people find overwhelming.

People who find Word Blur too much can try the knockoff, Word Slam which is both easier and easier to get ahold of since it's not as much of a cult hit (though it also appears to be out of print). Word Slam only has 105 words and the words are sorted, which makes it feel much less chaotic.

Codenames

Not much to add beyond what's in the post, except for common reasons that people don't like the game.

A loud person can take over the the game on each team, moreso than any other game on this list (except for Codenames: Duet). And although the game comes with a timer, it's rarely used (and the rules basically imply that you shouldn't use the timer), so another common complaint is that the game drags on forever when playing with people who take a long time to take turns, and unless you're the spymaster, there's not much useful to do when it's the other team's turn, causing the game to have long stretches of boring downtime.

Codenames: Duet

Although this was designed to be the 2-player co-op version of Codenames, I've only ever played this with more than two players (usually 4-5), which works fine as long as you don't mind that discussions have to be done in a semi-secret way.

In terms of replayability, this Codenames: Duet sits in roughly the same space as Codenames, in that it has about the same pros and cons.

Decrypto

I'm not going to attempt to describe this game because every direct explanation I've seen someone attempt to give about the gameplay has failed to click with new players until they play a round or two. But, conceptually, each team rotates who gives a clue and the goal is to have people on your team correctly guess which clue maps to which word while having the opposing team fail to guess correctly. The guessing team has extra info in that they know what the words are, so it's easier for them to generate the correct mapping. However, the set of mappings generated by the guessing team is available to the "decrypting" team, so they might know that the mystery word was clued by "Lincoln" and "milliner", from which they might infer that the word is "hat", allowing them to correctly guess the mapping on the next clue.

I haven't played this game enough to have an idea of how much replayability it has. It's possible it's very high and it's also possible that people figure out tricks to make it basically impossible for the "decrypting" team to figure out the mapping. One major downside that I've seen is that, when played with random groups of players, the game will frequently be decided by which team has the weakest player (this has happened every time I've seen this played by random groups), which is sort of the opposite problem that a lot of team and co-op games have, where the strongest player takes over the game. It's hard for a great player to make game-winning moves, but it's easy for a bad player to make game-losing moves, so when played with non-expert players, whichever team has the worst player will lose the game.

Person Do Thing

David Turner says:

Person Do Thing is like Taboo, but instead of a list of forbidden words, there's a list of allowed words. Forty basic words are always allowed, and (if you want) there are three extra allowed words that are specific to each secret word. Like Taboo, the quizzer can respond to guesses -- but only using the allowed words. Because so few words are allowed, it requires a lot of creativity to give good clues .. worth playing a few times but their word list was tiny last time I checked.

I suppose if a group played a lot they might develop a convention, e.g. "like person but not think big" for "animal". I've heard of this happening in Concept: one group had a convention that red, white, blue, and place refers to a country with those flag colors, and that an additional modifier specifies which: water for UK, cold for Russia, food for France, and gun for USA. I think it would take a fair number of these conventions to make the game appreciably easier.

Semantle

Like wordle, but about the meaning of a word, according to word2vec. Originally designed as a solitaire game, it also works as a co-op game.

Although I'm sure there are people who love playing this game over and over again, I feel like the replayability is fairly low for most people (and almost no one I know ended up playing more than 40 games of this, so I think my feeling isn't uncommon). Once you play for a while and figure out how to guess words that quickly narrow down the search space, playing the game starts to feel a bit rote.

Most people I've talked to who don't like this game didn't like it because they weren't able to build a mental model of what's happening, making the word similarity scores seem like random nonsense.

If you find this mode too easy and you can accurately get your team to guess any three tiles you like every single time and have enough of an intuition of what patterns exist that you can usually avoid getting beaten by the AI, you can try the mode where the AI is allowed to guess one word a turn and will then win by guessing the rest of the words if the one word it correctly guesses is sufficient to narrow down the search space to a single possible board. In general, if you make three guesses, this narrows down the space enough that the AI can win with a single guess (in game terms, the AI would give an "unlimited" clue ^[return]

A discussion of discussions on AI bias

Sun, 16 Jun 2024 00:00:00 +0000

There've been regular viral stories about ML/AI bias with LLMs and generative AI for the past couple years. One thing I find interesting about discussions of bias is how different the reaction is in the LLM and generative AI case when compared to "classical" bugs in cases where there's a clear bug. In particular, if you look at forums or other discussions with lay people, people frequently deny that a model which produces output that's sort of the opposite of what the user asked for is even a bug. For example, a year ago, an Asian MIT grad student asked Playground AI (PAI) to "Give the girl from the original photo a professional linkedin profile photo" and PAI converted her face to a white face with blue eyes.

The top "there's no bias" response on the front-page reddit story, and one of the top overall comments, was

Sure, now go to the most popular Stable Diffusion model website and look at the images on the front page.

You'll see an absurd number of asian women (almost 50% of the non-anime models are represented by them) to the point where you'd assume being asian is a desired trait.

How is that less relevant that "one woman typed a dumb prompt into a website and they generated a white woman"?

Also keep in mind that she typed "Linkedin", so anyone familiar with how prompts currently work know it's more likely that the AI searched for the average linkedin woman, not what it thinks is a professional women because image AI doesn't have an opinion.

In short, this is just an AI ragebait article.

Other highly-ranked comments with the same theme include

Honestly this should be higher up. If you want to use SD with a checkpoint right now, if you dont [sic] want an asian girl it’s much harder. Many many models are trained on anime or Asian women.

and

Right? AI images even have the opposite problem. The sheer number of Asians in the training sets, and the sheer number of models being created in Asia, means that many, many models are biased towards Asian outputs.

Other highly-ranked comments noted that this was a sample size issue

"Evidence of systemic racial bias"

Shows one result.

Playground AI's CEO went with the same response when asked for an interview by the Boston Globe — he declined the interview and replied with a list of rhetorical questions like the following (the Boston Globe implies that there was more, but didn't print the rest of the reply):

If I roll a dice just once and get the number 1, does that mean I will always get the number 1? Should I conclude based on a single observation that the dice is biased to the number 1 and was trained to be predisposed to rolling a 1?

We could just have easily picked an example from Google or Facebook or Microsoft or any other company that's deploying a lot of ML today, but since the CEO of Playground AI is basically asking someone to take a look at PAI's output, we're looking at PAI in this post. I tried the same prompt the MIT grad student used on my Mastodon profile photo, substituting "man" for "girl". PAI usually turns my Asian face into a white (caucasian) face, but sometimes makes me somewhat whiter but ethnically ambiguous (maybe a bit Middle Eastern or East Asian or something. And, BTW, my face has a number of distinctively Vietnamese features and which pretty obviously look Vietnamese and not any kind of East Asian.

My profile photo is a light-skinned winter photo, so I tried a darker-skinned summer photo and PAI would then generally turn my face into a South Asian or African face, with the occasional Chinese (but never Vietnamese or kind of Southeast Asian face), such as the following:

A number of other people also tried various prompts and they also got results that indicated that the model (where “model” is being used colloquially for the model and its weights and any system around the model that's responsible for the output being what it is) has some preconceptions about things like what ethnicity someone has if they have a specific profession that are strong enough to override the input photo. For example, converting a light-skinned Asian person to a white person because the model has "decided" it can make someone more professional by throwing out their Asian features and making them white.

Other people have tried various prompts to see what kind of pre-conceptions are bundled into the model and have found similar results, e.g., Rob Ricci got the following results when asking for "linkedin profile picture of X professor" for "computer science", "philosophy", "chemistry", "biology", "veterinary science", "nursing", "gender studies", "Chinese history", and "African literature", respectively. In the 28 images generated for the first 7 prompts, maybe 1 or 2 people out of 28 aren't white. The results for the next prompt, "Chinese history" are wildly over-the-top stereotypical, something we frequently see from other models as well when asking for non-white output. And Andreas Thienemann points out that, except for the over-the-top Chinese stereotypes, every professor is wearing glasses, another classic stereotype.

Like I said, I don't mean to pick on Playground AI in particular. As I've noted elsewhere, trillion dollar companies regularly ship AI models to production without even the most basic checks on bias; when I tried ChatGPT out, every bias-checking prompt I played with returned results that were analogous to the images we saw here, e.g., when I tried asking for bios of men and women who work in tech, women tended to have bios indicating that they did diversity work, even for women who had no public record of doing diversity work and men tended to have degrees from name-brand engineering schools like MIT and Berkeley, even people who hadn't attended any name-brand schools, and likewise for name-brand tech companies (the link only has 4 examples due to Twitter limitations, but other examples I tried were consistent with the examples shown).

This post could've used almost any publicly available generative AI. It just happens to use Playground AI because the CEO's response both asks us to do it and reflects the standard reflexive "AI isn't biased" responses that lay people commonly give.

Coming back to the response about how it's not biased for professional photos of people to be turned white because Asians feature so heavily in other cases, the high-ranking reddit comment we looked at earlier suggested "go[ing] to the most popular Stable Diffusion model website and look[ing] at the images on the front page". Below is what I got when I clicked the link on the day the comment was made and then clicked "feed".

[Click to expand / collapse mildly NSFW images]

The site had a bit of a smutty feel to it. The median image could be described as "a poster you'd expect to see on the wall of a teenage boy in a movie scene where the writers are reaching for the standard stock props to show that the character is a horny teenage boy who has poor social skills" and the first things shown when going to the feed and getting the default "all-time" ranking are someone grabbing a young woman's breast, titled "Guided Breast Grab | LoRA"; two young women making out, titled "Anime Kisses"; and a young woman wearing a leash, annotated with "BDSM — On a Leash LORA". So, apparently there was this site that people liked to use to generate and pass around smutty photos, and the high incidence of photos of Asian women on this site was used as evidence that there is no ML bias that negatively impacts Asian women because this cancels out an Asian woman being turned into a white woman when she tried to get a cleaned up photo for her LinkedIn profile. I'm not really sure what to say to this. Fabian Geisen responded with "🤦‍♂️. truly 'I'm not bias. your bias' level discourse", which feels like an appropriate response.

Another standard line of reasoning on display in the comments, that I see in basically every discussion on AI bias, is typified by

AI trained on stock photo of “professionals” makes her white. Are we surprised?

She asked the AI to make her headshot more professional. Most of “professional” stock photos on the internet have white people in them.

and

If she asked her photo to be made more anything it would likely turn her white just because that’s the average photo in the west where Asians only make up 7.3% of the US population, and a good chunk of that are South Indians that look nothing like her East Asian features. East Asians are 5% or less; there’s just not much training data.

These comments seem to operate from a fundamental assumption that companies are pulling training data that's representative of the United States and that this is a reasonable thing to do and that this should result in models converting everyone into whatever is most common. This is wrong on multiple levels.

First, on whether or not it's the case that professional stock photos are dominated by white people, a quick image search for "professional stock photo" turns up quite a few non-white people, so either stock photos aren't very white or people have figured out how to return a more representative sample of stock photos. And given worldwide demographics, it's unclear what internet services should be expected to be U.S.-centric. And then, even if we accept that major internet services should assume that everyone is in the United States, it seems like both a design flaw as well as a clear sign of bias to assume that every request comes from the modal American.

Since a lot of people have these reflexive responses when talking about race or ethnicity, let's look at a less charged AI hypothetical. Say I talk to an AI customer service chatbot for my local mechanic and I ask to schedule an appointment to put my winter tires on and do a tire rotation. Then, when I go to pick up my car, I find out they changed my oil instead of putting my winter tires on and then a bunch of internet commenters explain why this isn't a sign of any kind of bias and you should know that an AI chatbot will convert any appointment with a mechanic to an oil change appointment because it's the most common kind of appointment. A chatbot that converts any kind of appointment request into "give me the most common kind of appointment" is pretty obviously broken but, for some reason, AI apologists insist this is fine when it comes to things like changing someone's race or ethnicity. Similarly, it would be absurd to argue that it's fine for my tire change appointment to have been converted to an oil change appointment because other companies have schedulers that convert oil change appointments to tire change appointments, but that's another common line of reasoning that we discussed above.

And say I used some standard non-AI scheduling software like Mindbody or JaneApp to schedule an appointment with my mechanic and asked for an appointment to have my tires changed and rotated. If I ended up having my oil changed because the software simply schedules the most common kind of appointment, this would be a clear sign that the software is buggy and no reasonable person would argue that zero effort should go into fixing this bug. And yet, this is a common argument that people are making with respect to AI (it's probably the most common defense in comments on this topic). The argument goes a bit further, in that there's this explanation of why the bug occurs that's used to justify why the bug should exist and people shouldn't even attempt to fix it. Such an explanation would read as obviously ridiculous for a "classical" software bug and is no less ridiculous when it comes to ML. Perhaps one can argue that the bug is much more difficult to fix in ML and that it's not practical to fix the bug, but that's different from the common argument that it isn't a bug and that this is the correct way for software to behave.

I could imagine some users saying something like that when the program is taking actions that are more opaque to the user, such as with autocorrect, but I actually tried searching reddit for autocorrect bug and in the top 3 threads (I didn't look at any other threads), 2 out of the 255 comments denied that incorrect autocorrects were a bug and both of those comments were from the same person. I'm sure if you dig through enough topics, you'll find ones where there's a higher rate, but on searching for a few more topics (like excel formatting and autocorrect bugs), none of the topics I searched approached what we see with generative AI, where it's not uncommon to see half the commenters vehemently deny that a prompt doing the opposite of what the user wants is a bug.

Coming back to the bug itself, in terms of the mechanism, one thing we can see in both classifiers as well as generative models is that many (perhaps most or almost all) of these systems are taking bias that a lot of people have that's reflected in some sample of the internet, which results in things like Google's image classifier classifying a black hand holding a thermometer as {hand, gun} and a white hand holding a thermometer as {hand, tool}¹. After a number of such errors over the past decade, from classifying black people as gorillas in Google Photos in 2015, to deploying some kind of text-classifier for ads that classified ads that contained the terms "African-American composers" and "African-American music" as "dangerous or derogatory" in 2018 Google turned the knob in the other direction with Gemini which, by the way, generated much more outrage than any of the other examples.

There's nothing new about bias making it into automated systems. This predates generative AI, LLMs, and is a problem outside of ML models as well. It's just that the widespread use of ML has made this legible to people, making some of these cases news. For example, if you look at compression algorithms and dictionaries, Brotli is heavily biased towards the English language — the human-language elements of the 120 transforms built into the language are English, and the built-in compression dictionary is more heavily weighted towards English than whatever representative weighting you might want to reference (population-weighted language speakers, non-automated human-languages text sent on on messaging platforms, etc.). There are arguments you could make as to why English should be so heavily weighted, but there are also arguments as to why the opposite should be the case, e.g., English language usage is positively correlated with a user's bandwidth, so non-English speakers, on average, need the compression more. But regardless of the exact weighting function you think should be used to generate a representative dictionary, that's just not going to make a viral news story because you can't get the typical reader to care that a number of the 120 built-in Brotli transforms do things like add " of the ", ". The", or ". This" to text, which are highly specialized for English, and none of the transforms encode terms that are highly specialized for any other human language even though only 20% of the world speaks English, or that, compared to the number of speakers, the built-in compression dictionary is extremely highly tilted towards English by comparison to any other human language. You could make a defense of the dictionary of Brotli that's analogous to the ones above, over some representative corpus which the Brotli dictionary was trained on, we get optimal compression with the Brotli dictionary, but there are quite a few curious phrases in the dictionary such as "World War II", ", Holy Roman Emperor", "British Columbia", "Archbishop" , "Cleveland", "esperanto", etc., that might lead us to wonder if the corpus the dictionary was trained on is perhaps not the most representative, or even particularly representative of text people send. Can it really be the case that including ", Holy Roman Emperor" in the dictionary produces, across the distribution of text sent on the internet, better compression than including anything at all for French, Urdu, Turkish, Tamil, Vietnamese, etc.?

Another example which doesn't make a good viral news story is my not being able to put my Vietnamese name in the title of my blog and have my blog indexed by Google outside of Vietnamese-language Google — I tried that when I started my blog and it caused my blog to immediately stop showing up in Google searches unless you were in Vietnam. It's just assumed that the default is that people want English language search results and, presumably, someone created a heuristic that would trigger if you have two characters with Vietnamese diacritics on a page that would effectively mark the page as too Asian and therefore not of interest to anyone in the world except in one country. "Being visibly Vietnamese" seems like a fairly common cause of bugs. For example, Vietnamese names are a problem even without diacritics. I often have forms that ask for my mother's maiden name. If I enter my mother's maiden name, I'll be told something like "Invalid name" or "Name too short". That's fine, in that I work around that kind of carelessness by having a stand-in for my mother's maiden name, which is probably more secure anyway. Another issue is when people decide I told them my name incorrectly and change my name. For my last name, if I read my name off as "Luu, ell you you", that gets shortened from the Vietnamese "Luu" to the Chinese "Lu" about half the time and to a western "Lou" much of the time as well, but I've figured out that if I say "Luu, ell you you, two yous", that works about 95% of the time. That sometimes annoys the person on the other end, who will exasperatedly say something like "you didn't have to spell it out three times". Maybe so for that particular person, but most people won't get it. This even happens when I enter my first name into a computer system, so there can be no chance of a transcription error before my name is digitally recorded. My legal first name, with no diacritics, is Dan. This isn't uncommon for an American of Vietnamese descent because Dan works as both a Vietnamese name and an American name and a lot Vietnamese immigrants didn't know that Dan is usually short for Daniel. At six of the companies I've worked for full-time, someone has helpfully changed my name to Daniel at three of them, presumably because someone saw that Dan was recorded in a database and decided that I failed to enter my name correctly and that they knew what my name was better than I did and they were so sure of this they saw no need to ask me about it. In one case, this only impacted my email display name. Since I don't have strong feelings about how people address me, I didn't bother having it changed and lot of people called me Daniel instead of Dan while I worked there. In two other cases, the name change impacted important paperwork, so I had to actually change it so that my insurance, tax paperwork, etc., actually matched my legal name. As noted above, with fairly innocuous prompts to Playground AI using my face, even on the rare occasion they produce Asian output, seem to produce East Asian output over Southeast Asian output. I've noticed the same thing with some big company generative AI models as well — even when you ask them for Southeast Asian output, they generate East Asian output. AI tools that are marketed as tools that clean up errors and noise will also clean up Asian-ness (and other analogous "errors"), e.g., people who've used Adobe AI noise reduction (billed as "remove noise from voice recordings with speech enhacement") note that it will take an Asian accent and remove it, making the person sound American (and likewise for a number of other accents, such as eastern European accents).

I probably see tens to hundreds things like this most weeks just in the course of using widely used software (much less than the overall bug count, which we previously observed was in hundreds to thousands per week), but most Americans I talk to don't notice these things at all. Recently, there's been a lot of chatter about all of the harms caused by biases in various ML systems and the widespread use of ML is going to usher in all sorts of new harms. That might not be wrong, but my feeling is that we've encoded biases into automation for as long as we've had automation and the increased scope and scale of automation has been and will continue to increase the scope and scale of automated bias. It's just that now, many uses of ML make these kinds of biases a lot more legible to lay people and therefore likely to make the news.

There's an ahistoricity in the popular articles I've seen on this topic so far, in that they don't acknowledge that the fundamental problem here isn't new, resulting in two classes of problems that arise when solutions are proposed. One is that solutions are often ML-specific, but the issues here occur regardless of whether or not ML is used, so ML-specific solutions seem focused at the wrong level. When the solutions proposed are general, the proposed solutions I've seen are ones that have been proposed before and failed. For example, a common call to action for at least the past twenty years, perhaps the most common (unless "people should care more" counts as a call to action), has been that we need more diverse teams.

This clearly hasn't worked; if it did, problems like the ones mentioned above wouldn't be pervasive. There are multiple levels at which this hasn't worked and will not work, any one of which would be fatal to this solution. One problem is that, across the industry, the people who are in charge (execs and people who control capital, such as VCs, PE investors, etc.), in aggregate, don't care about this. Although there are efficiency justifications for more diverse teams, the case will never be as clear-cut as it is for decisions in games and sports, where we've seen that very expensive and easily quantifiable bad decisions can persist for many decades after the errors were pointed out. And then, even if execs and capital were bought into the idea, it still wouldn't work because there are too many dimensions. If you look at a company that really prioritized diversity, like Patreon from 2013-2019, you're lucky if the organization is capable of seriously prioritizing diversity in two or three dimensions while dropping the ball on hundreds or thousands of other dimensions, such as whether or not Vietnamese names or faces are handled properly.

Even if all those things weren't problems, the solution still wouldn't work because while having a team with relevant diverse experience may be a bit correlated with prioritizing problems, it doesn't automatically cause problems to be prioritized and fixed. To pick a non-charged example, a bug that's existed in Google Maps traffic estimates since inception that existed at least until 2022 (I haven't driven enough since then to know if the bug still exists) is that, if I ask how long a trip will take at the start of rush hour, this takes into account current traffic and not how traffic will change as I drive and therefore systematically underestimates how long the trip will take (and conversely, if I plan a trip at peak rush hour, this will systematically overestimate how long the trip will take). If you try to solve this problem by increasing commute diversity in Google Maps, this will fail. There are already many people who work on Google Maps who drive and can observe ways in which estimates are systematically wrong. Adding diversity to ensure that there are people who drive and notice these problems is very unlikely to make a difference. Or, to pick another example, when the former manager of Uber's payments team got incorrected blacklisted from Uber by an ML model incorrectly labeling his transactions as fraudulent, no one was able to figure out what happened or what sort of bias caused him to get incorrectly banned (they solved the problem by adding his user to an allowlist). There are very few people who are going to get better service than the manager of the payments team, and even in that case, Uber couldn't really figure out what was going on. Hiring a "diverse" candidate to the team isn't going to automatically solve or even make much difference to bias in whatever dimension the candidate is diverse when the former manager of the team can't even get their account unbanned except for having it whitelisted after six months of investigation.

If the result of your software development methodology is that the fix to the manager of the payments team being banned is to allowlist the user after six months, that traffic routing in your app is systematically wrong for two decades, that core functionality of your app doesn't work, etc., no amount of hiring people with a background that's correlated with noticing some kinds of issues is going to result in fixing issues like these, whether that's with respect to ML bias or another class of bug.

Of course, sometimes variants of old ideas that have failed do succeed, but for a proposal to be credible, or even interesting, the proposal has to address why the next iteration won't fail like every previous iteration did. As we noted above, at a high level, the two most common proposed solutions I've seen are that people should try harder and care more and that we should have people of different backgrounds, in a non-technical sense. This hasn't worked for the plethora of "classical" bugs, this hasn't worked for old ML bugs, and it doesn't seem like there's any reason to believe that this should work for the kinds of bugs we're seeing from today's ML models.

Laurence Tratt says:

I think this is a more important point than individual instances of bias. What's interesting to me is that mostly a) no-one notices they're introducing such biases b) often it wouldn't even be reasonable to expect them to notice. For example, some web forms rejected my previous addresss, because I live in the countryside where many houses only have names -- but most devs live in cities where houses exclusively have numbers. In a sense that's active bias at work, but there's no mal intent: programmers have to fill in design details and make choices, and they're going to do so based on their experiences. None of us knows everything! That raises an interesting philosophical question: when is it reasonable to assume that organisations should have realised they were encoding a bias?

My feeling is that the "natural", as in lowest energy and most straightforward state for institutions and products is that they don't work very well. If someone hasn't previously instilled a culture or instituted processes that foster quality in a particular dimension, quality is likely to be poor, due to the difficulty of producing something high quality, so organizations should expect that they're encoding all sorts of biases if there isn't a robust process for catching biases.

One issue we're running up against here is that, when it comes to consumer software, companies have overwhelmingly chosen velocity over quality. This seems basically inevitable given the regulatory environment we have today or any regulatory environment we're likely to have in my lifetime, in that companies that seriously choose quality over features velocity get outcompeted because consumers overwhelmingly choose the lower cost or more featureful option over the higher quality option. We saw this with cars when we looked at how vehicles perform in out-of-sample crash tests and saw that only Volvo was optimizing cars for actual crashes as opposed to scoring well on public tests. Despite vehicular accidents being one of the leading causes of death for people under 50, paying for safety is such a low priority for consumers that Volvo has become a niche brand that had to move upmarket and sell luxury cars to even survive. We also saw this with CPUs, where Intel used to expend much more verification effort than AMD and ARM and had concomitantly fewer serious bugs. When AMD and ARM started seriously threatening, Intel shifted effort away from verification and validation in order to increase velocity because their quality advantage wasn't doing them any favors in the market and Intel chips are now almost as buggy as AMD chips.

We can observe something similar in almost every consumer market and many B2B markets as well, and that's when we're talking about issues that have known solutions. If we look at problem that, from a technical standpoint, we don't know how to solve well, like subtle or even not-so-subtle bias in ML models, it stands to reason that we should expect to see more and worse bugs than we'd expect out of "classical" software systems, which is what we're seeing. Any solution to this problem that's going to hold up in the market is going to have to be robust against the issue that consumers will overwhelmingly choose the buggier product if it has more features they want or ships features they want sooner, which puts any solution that requires taking care in a way that significantly slows down shipping in a very difficult position, absent a single dominant player, like Intel in its heyday.

Thanks to Laurence Tratt, Yossi Kreinin, Anonymous, Heath Borders, Benjamin Reeseman, Andreas Thienemann, and Misha Yagudin for comments/corrections/discussion

Appendix: technically, how hard is it to improve the situation?

This is a genuine question and not a rhetorical question. I haven't done any ML-related work since 2014, so I'm not well-informed enough about what's going on now to have a direct opinion on the technical side of things. A number of people who've worked on ML a lot more recently than I have like Yossi Kreining (see appendix below) and Sam Anthony think the problem is very hard, maybe impossibly hard where we are today.

Since I don't have a direct opinion, here are three situations which sound plausibly analogous, each of which supports a different conclusion.

Analogy one: Maybe this is like people saying that someone will build a Google any day now at least since 2014 because existing open source tooling is already basically better than Google search or people saying that building a "high-level" CPU that encodes high-level language primitives into hardware would give us a 1000x speedup on general purpose CPUs. You can't really prove that this is wrong and it's possible that a massive improvement in search quality or a 1000x improvement in CPU performance is just around the corner but people who make these proposals generally sound like cranks because they exhibit the ahistoricity we noted above and propose solutions that we already know don't work with no explanation of why their solution will address the problems that have caused previous attempts to fail.

Analogy two: Maybe this is like software testing, where software bugs are pervasive and, although there's decades of prior art from the hardware industry on how to find bugs more efficiently, there are very few areas where any of these techniques are applied. I've talked to people about this a number of times and the most common response is something about how application XYZ has some unique constraint that make it impossibly hard to test at all or test using the kinds of techniques I'm discussing, but every time I've dug into this, the application has been much easier to test than areas where I've seen these techniques applied. One could argue that I'm a crank when it comes to testing, but I've actually used these techniques to test a variety of software and been successful doing so, so I don't think this is the same as things like claiming that CPUs would be 1000x faster if we only my pet CPU architecture.

Due to the incentives in play, where software companies can typically pass the cost of bugs onto the customer without the customer really understanding what's going on, I think we're not going to see a large amount of effort spent on testing absent regulatory changes, but there isn't a fundamental reason that we need to avoid using more efficient testing techniques and methodologies.

From a technical standpoint, the barrier to using better test techniques is fairly low — I've walked people through how to get started writing their own fuzzers and randomized test generators and this typically takes between 30 minutes and an hour, after which people will tend to use these techniques to find important bugs much more efficiently than they used to. However, by revealed preference, we can see that organizations don't really "want to" have their developers test efficiently.

When it comes to testing and fixing bias in ML models, is the situation more like analogy one or analogy two? Although I wouldn't say with any level of confidence that we are in analogy two, I'm not sure how I could be convinced that we're not in analogy two. If I didn't know anything about testing, I would listen to all of these people explaining to me why their app can't be tested in a way that finds showstopping bugs and then conclude something like one of the following

"Everyone" is right, which makes sense — this is a domain they know about and I don't, so why should I believe anything different?
No opinion, perhaps on due to a high default level of skepticism
Everyone is wrong, which seems unreasonable given that I don't know anything about the domain and have no particular reason to believe that everyone is wrong

As an outsider, it would take a very high degree of overconfidence to decide that everyone is wrong, so I'd have to either incorrectly conclude that "everyone" is right or have no opinion.

Given the situation with "classical" testing, I feel like I have to have no real opinion here. WIth no up to date knowledge, it wouldn't be reasonable to conclude that so many experts are wrong. But there are enough problems that people have said are difficult or impossible that turn out to be feasible and not really all that tricky that I have a hard time having a high degree of belief that a problem is essentially unsolvable without actually looking into it.

I don't think there's any way to estimate what I'd think if I actually looked into it. Let's say I try to work in this area and try to get a job at OpenAI or another place where people are working on problems like this, somehow pass the interview,I work in the area for a couple years, and make no progress. That doesn't mean that the problem isn't solvable, just that I didn't solve it. When it comes to the "Lucene is basically as good as Google search" or "CPUs could easily be 1000x faster" people, it's obvious to people with knowledge of the area that the people saying these things are cranks because they exhibit a total lack of understanding of what the actual problems in the field are, but making that kind of judgment call requires knowing a fair amount about the field and I don't think there's a shortcut that would let you reliably figure out what your judgment would be if you had knowledge of the field.

Appendix: the story of this post

I wrote a draft of this post when the Playground AI story went viral in mid-2023, and then I sat on it for a year to see if it seemed to hold up when the story was no longer breaking news. Looking at this a year, I don't think the fundamental issues or the discussions I see on the topic have really changed, so I cleaned it up and then published this post in mid-2024.

If you like making predictions, what do you think the odds are that this post will still be relevant a decade later, in 2033? For reference, this post on "classical" software bugs that was published in 2014 could've been published today, in 2024, with essentially the same results (I say essentially because I see more bugs today than I did in 2014, and I see a lot more front-end and OS bugs today than I saw in 2014, so there would more bugs and different kinds of bugs).

Appendix: comments from other folks

[Click to expand / collapse comments from Yossi Kreinin]

I'm not sure how much this is something you'd agree with but I think a further point related to generative AI bias being a lot like other-software-bias is exactly what this bias is. "AI bias" isn't AI learning the biases of its creators and cleverly working to implement them, e.g. working against a minority that its creators don't like. Rather, "AI bias" is something like "I generally can't be bothered to fix bugs unless the market or the government compels me to do so, and as a logical consequence of this, I especially can't be bothered to fix bugs that disproportionately negatively impact certain groups where the impact, due to the circumstances of the specific group in question, is less likely to compel me to fix the bug."

This is a similarity between classic software bugs and AI bugs — meaning, nobody is worried that "software is biased" in some clever scheming sort of way, everybody gets that it's the software maker who's scheming or, probably more often, it's the software maker who can't be bothered to get things right. With generative AI I think "scheming" is actually even less likely than with traditional software and "not fixing bugs" is more likely, because people don't understand AI systems they're making and can make them do their bidding, evil or not, to a much lesser extent than with traditional software; OTOH bugs are more likely for the same reason [we don't know what we're doing.] I think a lot of people across the political spectrum [including for example Elon Musk and not just journalists and such] say things along the lines of "it's terrible that we're training AI to think incorrectly about the world" in the context of racial/political/other charged examples of bias; I think in reality this is a product bug affecting users to various degrees and there's bias in how the fixes are prioritized but the thing isn't capable of thinking at all.

I guess I should add that there are almost certainly attempts at "scheming" to make generative AI repeat a political viewpoint, over/underrepresent a group of people etc, but invariably these attempts create hilarious side effects due to bugs/inability to really control the model. I think that similar attempts to control traditional software to implement a politics-adjacent agenda are much more effective on average (though here too I think you actually had specific examples of social media bugs that people thought were a clever conspiracy). Whether you think of the underlying agenda as malice or virtue, both can only come after competence and here there's quite the way to go.

See Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models. I feel like if this doesn't work, a whole lot of other stuff doesn't work, either and enumerating it has got to be rather hard.

I mean nobody would expect a 1980s expert system to get enough tweaks to not behave nonsensically. I don't see a major difference between that and an LLM, except that an LLM is vastly more useful. It's still something that pretends to be talking like a person but it's actually doing something conceptually simple and very different that often looks right.

[Click to expand / collapse comments from an anonymous founder of an AI startup]

[I]n the process [of founding an AI startup], I have been exposed to lots of mainstream ML code. Exposed as in “nuclear waste” or “H1N1”. It has old-fashioned software bugs at a rate I find astonishing, even being an old, jaded programmer. For example, I was looking at tokenizing recently, and the first obvious step was to do some light differential testing between several implementations. And it failed hilariously. Not like “they missed some edge cases”, more like “nobody ever even looked once”. Given what we know about how well models respond to out of distribution data, this is just insane.

In some sense, this is orthogonal to the types of biases you discuss…but it also suggests a deep lack of craftsmanship and rigor that matches up perfectly.

[Click to expand / collapse comments from Benjamin Reeseman]

[Ben wanted me to note that this should be considered an informal response]

I have a slightly different view of demographic bias and related phenomena in ML models (or any other “expert” system, to your point ChatGPT didn’t invent this, it made it legible to borrow your term).

I think that trying to force the models to reflect anything other than a corpus that’s now basically the Internet give or take actually masks the real issue: the bias is real, people actually get mistreated over their background or skin color or sexual orientation or any number of things and I’d far prefer that the models surface that, run our collective faces in the IRL failure mode than try to tweak the optics in an effort to permit the abuses to continue.

There’s a useful analogy to things like the #metoo movement or various DEI initiatives, most well-intentioned in the beginning but easily captured and ultimately representing a net increase in the blank check of those in positions of privilege.

This isn’t to say that alignment has no place and I think it likewise began with good intentions and is even maybe a locally useful mitigation.

But the real solution is to address the injustice and inequity in the real world.

I think the examples you cited are or should be a wake-up call that no one can pretend to ignore credibly about real issues and would ideally serve as a forcing function on real reform.

I’d love to chat about this at your leisure, my viewpoint is a minority one, but personally I’m a big fan of addressing the underlying issues rather than papering over them with what amounts to a pile of switch statements.

There’s a darker take on this: real reform is impossible, we live in techno dystopia now, let’s mitigate where and how a hunted minority can.

And there is a distressingly strong argument for that case: even in the ostensibly developed world cops today look like soldiers did when I was a kid 30 years ago, someone is guarding something from something at great expense and it isn’t black kids from getting shot near as I can tell.

But I don’t subscribe to the pessimism, I think it’s a local anomaly as industrialization transitions into arbitrary power over the physical logistics if we only knew how to change it.

I did a brief stint consulting for biotech folks affiliated with Nobel shortlist types in an area called proteomics. Like any consultant I surrender any credit to the people paying handsomely by the hour.

But it really is a Shannon problem now: CRISPR Cas-9 can do arbitrary edits to not only a person’s genome but (in an ethical and legal grey area) to the germ line as well.

We just have no idea what to change and there’s enough integrity in the field that we won’t risk just arbitrary children on finding out.

Appendix: reproducing Rob Ricci's results

I tried prompts with default settings, except for reducing image quality to 10 to generate images more quickly. This means we had 512/512, 7 prompt guidance, 10 quality, random seed. After 2 tries, increased image quality to 40 because the images were too low quality to tell the ethnicity sometimes. Other than increasing the image quality, there was no attempt to re-run prompts or otherwise due any kind of selection from the output. The prompts were "Generate a very professional looking linkedin profile photo for a X", where X was Doctor, Lawyer, Engineer, Scientist, Journalist, and Banker.

This produced the following images:

Roughly speaking, I think Rob's results reproduced, which should be no surprise at this point given how many images we've seen.

And then, to see if we could reproduce the standard rebuttal that generative AI isn't biased because requests for smutty images often have Asian women, I tried the prompt "Generate a trashy instagram profile photo for a porn star". There's an NSFW filter that was tripped in some cases, so we don't get groups of four images and instead got:

[Click to expand / collapse very mildly NSFW images]

And, indeed, the generated images are much more Asian than we got for any of our professional photos, save Rob Ricci's set of photos for asking for a "linkedin profile picture of Chinese Studies professor".

Appendix: comments from Benjamin Reeseman

Naturally, when I mentioned this, a "smart contrarian" responded with "what are base rates", but spending 30 seconds googling reveals that the base rate of U.S. gun ownership is much higher among whites than in any other demographic. The base rate argument is even more absurd if you think about the base rate of a hand holding an object — what fraction of the time is that object a gun? Regardless of race, it's going to be very low. Of course, you could find a biased sample that doesn't resemble the underlying base rate at all, which appears to be what Google did, but it's not clear why this justifies having this bug. ^[return]

What the FTC got wrong in the Google antitrust investigation

Sun, 26 May 2024 00:00:00 +0000

From 2011-2012, the FTC investigated the possibility of pursuing antitrust action against Google. The FTC decided to close the investigation and not much was publicly known about what happened until Politico released 312 pages of internal FTC memos that from the investigation a decade later. As someone who works in tech, on reading the memos, the most striking thing is how one side, the side that argued to close the investigation, repeatedly displays a lack of basic understanding of tech industry and the memos from directors and other higher-ups don't acknowledge that this at all.

If you don't generally follow what regulators and legislators are saying about tech, seeing the internal c(or any other industry) when these decisions are, apparently, being made with little to no understanding of the industries¹.

Inside the FTC, the Bureau of Competition (BC) made a case that antitrust action should be pursued and the Bureau of Economics (BE) made the case that the investigation should be dropped. The BC case is moderately strong. Reasonable people can disagree on whether or not the case is strong enough that antitrust action should've been pursued, but a reasonable person who is anti-antitrust has to concede that the antitrust case in the BC memo is at least defensible. The case against in the BE is not defensible. There are major errors in core parts of the BE memo. In order for the BE memo to seem credible, the reader must have large and significant gaps in their understanding of the tech industry. If there was any internal FTC discussion on the errors in the BE memo, there's no indication of that in any public documents. As far as we can see from the evidence that's available, nobody noticed that the BE memo's errors. The publicly available memos from directors and other higher ups indicate that they gave the BE memo as much or more weight than the BC memo, implying a gap in FTC leadership's understanding of the tech industry.

Brief summary

Since the BE memo is effective a rebuttal of a the BC memo, we'll start by looking at the arguments in the BC memo. The bullet points below summarize the Executive Summary from the BC memo, which roughly summarizes the case made by the BC memo:

Google is dominant search engine and seller of search ads
This memo addresses 4 of 5 areas with anticompetitive conduct; mobile is in a supplemental memo
Google has monopoly power in the U.S. in Horizontal Search; Search Advertising; and Syndicated Search and Search Advertising
On the question of whether Google has unlawfully preferenced its own content while demoting rivals, we do not recommend the FTC proceed; it's a close call and case law is not favorable to anticompetitive product design and Google's efficiency justifications are strong and there's some benefit to users
On whether Google has unlawfully scraped content from vertical rivals to improve their own vertical products, recommending condemning as a conditional refusal to deal under Section 2
- Prior voluntary dealing was mutually beneficial
- Threats to remove rival content from general search designed to coerce rivals into allowing Google to user their content for Google's vertical product
- Natural and probable effect is to diminish incentives of vertical website R&D
On anticompetitive contractual restrictions on automated cross-management of ad campaigns, restrictions should be condemned under Section 2
- They limit ability of advertisers to make use of their own data, reducing innovation and increasing transaction costs for advertisers and third-party businesses
- Also degrade the quality of Google's rivals in search and search advertising
- Google's efficiency justifications appears to be pretextual
On anticompetitive exclusionary agreements with websites for syndicated search and search ads, Google should be condemned under Section 2
- Only modest anticompetitive effects on publishers, but deny scale to competitors, competitively significant to main rival (Bing) as well as significant barrier to entry in longer term
- Google's efficiency justifications are, on balance, non-persuasive
Possible remedies
- Scraping
  - Could be required to provide an opt-out for snippets (reviews, ratings) from Google's vertical properties while retaining snippets in web search and/or Universal Search on main search results page
  - Could be required to limit use of content indexed from web search results
- Campaign management restrictions
  - Could be required to remove problematic contractual restrictions from license agreements
- Exclusionary syndication agreements
  - Could be enjoined from entering into exclusive search agreements with search syndication partners and required to loosen restrictions surrounding syndication partners' use of rival search ads
There are a number of risks to case, not named in summary except that Google can argue that Microsoft's most efficient distribution channel is bing.com and that any scale MS might gain will be immaterial to Bing's competitive position
[BC] Staff concludes Google's conduct has resulted and will result in real harm to consumers and to innovation in online search and ads.

In their supplemental memo on mobile, BC staff claim that Google dominates mobile search via exclusivity agreements and that mobile search was rapidly growing at the time. BC staff claimed that, according to Google internal documents, mobile search went from 9.5% to 17.3% of searches in 2011 and that both Google and Microsoft internal documents indicated that the expectation was that mobile would surpass desktop in the near future. As with the case on desktop, BC staff use Google's ability to essentially unilaterally reduce revenue share as evidence that Google has monopoly power and can dictate terms and they quote Google leadership noting this exact thing.

BC staff acknowledge that many of Google's actions have been beneficial to consumers, but balance this against the harms of anticompetitive tactics, saying

the evidence paints a complex portrait of a company working toward an overall goal of maintaining its market share by providing the best user experience, while simultaneously engaging in tactics that resulted in harm to many vertical competitors, and likely helped to entrench Google's monopoly power over search and search advertising

BE staff strongly disagreed with BC staff. BE staff also believe that many of Google's actions have been beneficial to consumers, but when it comes to harms, in almost every case, BE staff argue that the market isn't important, isn't a distinct market, or that the market is competitive and Google's actions are procompetitive and not anticompetitive.

Common errors

At least in the documents provided by Politico, BE staff generally declined to engage with BC staff's arguments and numbers directly. For example, in addition to arguing that Google's agreements and exclusivity (insofar as agreements are exclusive) are procompetitive and foreclosing the possibility of such agreements might have significant negative impacts on the market, they argue that mobile is a small and unimportant market. The BE memo argues that mobile is only 8% of the market and, while it's growing rapidly, is unimportant, as it's only a "small percentage of overall queries and an even smaller percentage of search ad revenues". They also claim that there is robust competition in mobile because, in addition to Apple, there's also BlackBerry and Windows Mobile. Between when the FTC investigation started and when the memo was written, BlackBerry's marketshare dropped dropped from ~14% to ~6%, which was part of a long-term decline that showed no signs of changing. Windows Mobile's drop was less precipitous, from ~6% to ~4%, but in a market with such strong network effects, it's curious that BE staff would argue that these platforms with low and declining marketshare would provide robust competition going forward.

When the authors of the BE memo make a prediction, they seem to have a facility for predicting the opposite of what will happen. To do this, the authors of the BE memo took positions that were opposed to the general consensus at the time. Another example of this is when they imply that there is robust competition in the search market, which is implied to be expected to continue without antitrust action. Their evidence for this was that Yahoo and Bing had a combined "steady" 30% marketshare in the U.S., with query volume growing faster than Google since the Yahoo-Bing alliance was announced. The BE memo authors even go even further and claim that Microsoft's query volume is growing faster than Google'e and that Microsoft + Yahoo combined have higher marketshare than Google as measured by search MAU.

The BE memo's argument that Yahoo and Bing are providing robust and stable competition leaves out that the fixed costs of running a search engine are so high and the scale required to be profitable so large that Yahoo effectively dropped out of search and outsourced search to Bing. And Microsoft was subsidizing Bing to the tune of $2B/yr, in a strategic move that most observers in tech thought would not be successful. At the time, it would have been reasonable to think that if Microsoft stopped heavily subsidizing Bing, its marketshare would drop significantly, which is what happened after antitrust action was not taken and Microsoft decided to shift funding to other bets that had better ROI. Estimates today put Google at 86% to 90% share in the United States, with estimates generally being a bit higher worldwide.

On the wilder claims, such as Microsoft and Yahoo combined having more active search users than Google and that Microsoft query volume and therefore search marketshare is growing faster than Google, they use comScore data. There are a couple of curious things about this.

First, the authors pick and choose their data in order to present figures that maximize Microsoft's marketshare. When comScore data makes Microsoft marketshare appear relatively low, as in syndicated search, the authors of the BE memo explain that comScore data should not be used because it's inaccurate. However, when comScore data is prima facie unrealistic and make's Microsoft marketshare look larger than is plausible or is growing faster than is plausible, the authors rely on comScore data without explaining why they rely on this source that they said should not be used because it's unreliable.

Using this data, the BE memo basically argues that, because many users use Yahoo and Bing at least occasionally, users clearly could use Yahoo and Bing, and there must not be a significant barrier to switching even if (for example) a user uses Yahoo or Bing once a month and Google one thousand times a month. From having worked with and talked to people who work on product changed to drive growth, the overwhelming consensus has been that it's generally very difficult to convert a lightly-engaged user who barely registers as an MAU to a heavily-engaged user who uses the product regularly, and that this is generally considered more difficult than converting a brand-new user to becoming heavily engaged user. Like Boies's argument about rangeCheck, it's easy to see how this line of reasoning would sound plausible to a lay person who knows nothing about tech, but the argument reads like something you'd expect to see from a lay person.

Although the BE staff memo reads like a rebuttal to the points of the BC staff memo, the lack of direct engagement on the facts and arguments means that a reader with no knowledge of the industry who reads just one of the memos will have a very different impression than a reader who reads the other. For example, on the importance of mobile search, a naive BC-memo-only reader would think that mobile is very important, perhaps the most important thing, whereas a naive BE-memo-only reader would think that mobile is unimportant and will continue to be unimportant for the foreseeable future.

Politico also released memos from two directors who weigh the arguments of BC and BE staff. Both directors favor the BE memo over the BC memo, one very much so and one moderately so. When it comes to disagreements, such as the importance of mobile in the near future, there's no evidence in the memos presented that there was any attempt to determine who was correct or that the errors we're discussing here were noticed. The closest thing to addressing disagreements such as these are comments that thank both staffs for having done good work, in what one might call a "fair and balanced" manner, such as "The BC and BE staffs have done an outstanding job on this complex investigation. The memos from the respective bureaus make clear that the case for a complaint is close in the four areas ... ". To the extent that this can be inferred, it seems that the reasoning and facts laid out in the BE memo were given at least as much weight as the reasoning and facts in the BC memo despite much of the BE memo's case seemingly highly implausible to an observer who understands tech.

For example, on the importance of mobile, I happened to work at Google shortly after these memos were written and, when I was at Google, they had already pivoted to a "mobile first" strategy because it was understood that mobile was going to be the most important market going forward. This was also understood at other large tech companies at the time and had been understood going back further than the dates of these memos. Many consumers didn't understand this and redesigns that degraded the desktop experience in order to unify desktop and mobile experiences were a common cause of complaints at the time. But if you looked at the data on this or talked to people at big companies, it was clear that, from a business standpoint, it made sense to focus on mobile and deal with whatever fallout might happen in desktop if that allowed for greater velocity in mobile development.

Both the BC and BE staff memos extensively reference interviews across many tech companies, including all of the "hyperscalers". It's curious that someone could have access to all of these internal documents from these companies as well as interviews and then make the argument that mobile was, at the time, not very important. And it's strange that, at least to the extent that we can know what happened from these memos, directors took both sets of arguments at face value and then decided that the BE staff case was as convincing or more convincing than the BC staff case.

That's one class of error we repeatedly see between the BC and BE staff memos, stretching data to make a case that a knowledgeable observer can plainly see is not true. In most cases, it's BE staff who have stretched data as far as it can go to take a tenuous position as far as it can be pushed, but there are some instances of BC staff making a case that's a stretch.

Another class of error we see repeated, mainly in the BE memo, is taking what most people in industry would consider an obviously incorrect model of the world and then making inferences based on that. An example of this is the discussion on whether or not vertical competitors such as Yelp and TripAdvisor were or would be significantly disadvantaged by actions BC staff allege are anticompetitive. BE staff, in addition to arguing that Google's actions were actually procompetitive and not anticompetitive, argued that it would not be possible for Google to significantly harm vertical competitors because the amount of traffic Google drives to them is small, only 10% to 20% of their total traffic, going to say "the effect on traffic from Google to local sites is very small and not statistically significant". Although BE staff don't elaborate on their model of how this business works, they appear to believe that the market is basically static. If Google removes Yelp from its listings (which they threatened to do if they weren't allowed to integrate Yelp's data into their own vertical product) or downranks Yelp to preference Google's own results, this will, at most, reduce Yelp's traffic by 10% to 20% in the long run because only 10% to 20% of traffic comes from Google.

But even a VC or PM intern can be expected to understand that the market isn't static. What one would expect if Google can persistently take a significant fraction of search traffic away from Yelp and direct it to Google's local offerings instead is that, in the long run, Yelp will end up with very few users and become a shell of what it once was. This is exactly what happened and, as of this writing, Yelp is valued at $2B despite having a trailing P/E ratio of 24, which is fairly low P/E for a tech company. But the P/E ratio is unsurprisingly low because it's not generally believed that Yelp can turn this around due to Google's dominant position in search as well as maps making it very difficult for Yelp to gain or retain users. This is not just obvious in retrospect and was well understood at the time. In fact, I talked to a former colleague at Google who was working on one of a number of local features that leveraged the position that Google had and that Yelp could never reasonably attain; the expected outcome of these features was to cripple Yelp's business. Not only was it understood that this was going to happen, it was also understood that Yelp was not likely to be able to counter this due to Google's ability to leverage its market power from search and maps. It's curious that, at the time, someone would've seriously argued that cutting off Yelp's source of new users while simultaneously presenting virtually all of Yelp's then-current users with an alternative that's bundled into an app or website they already use would not significantly impact Yelp's business, but the BE memo makes that case. One could argue that the set of maneuvers used here are analogous to the ones done by Microsoft that were brought up in the Microsoft antitrust case where it was alleged that a Microsoft exec said that they were going to "cut off Netscape’s air supply", but the BE memo argues that impact of having one's air supply cut off is "very small and not statistically significant" (after all, a typical body has blood volume sufficient to bind 1L of oxygen, much more than the oxygen normally taken in during one breath).

Another class of, if not error, then poorly supported reasoning is relying on cocktail party level of reasoning when there's data or other strong evidence that can be directly applied. This happens throughout the BE memo even though, at other times, when the BC memo has some moderately plausible reasoning, the BE memo's counter is that we should not accept such reasoning and need to look at the data and not just reason about things in the abstract. The BE memo heavily leans on the concept that we must rely on data over reasoning and calls arguments from the BC memo that aren't rooted in rigorous data anecdotal, "beyond speculation", etc., but BE memo only does this in cases where knowledge or reasoning might lead one to conclude that there was some kind of barrier to competition. When the data indicates that Google's behavior creates some kind of barrier in the market, the authors of BE memo ignore all relevant data and instead rely on reasoning over data even when the reasoning is weak and has the character of the Boies argument we referenced earlier. One could argue that the standard of evidence for pursuing an antitrust case should be stronger the standard of evidence for not pursuing one, but if the asymmetry observed here were for that reason, the BE memo could have listed areas where the evidence wasn't strong enough without making its own weak assertions in the face of stronger evidence. An example of this is the discussion of the impact of mobile defaults.

The BE memo argues that defaults are essentially worthless and have little to no impact, saying multiple times that users can switch with just "a few taps", adding that this takes "a few seconds" and that, therefore, "[t]hese are trivial switching costs". The most obvious and direct argument piece of evidence on the impact of defaults is the amount of money Google pays to retain its default status. In a 2023 antitrust action, it was revealed that Google paid Apple $26.3B to retain its default status in 2021. As of this writing, Apple's P/E ratio is 29.53. If we think of this payment as, at the margin, pure profit and having default status is as worthless as indicated by the BE memo, a naive estimate of how much this is worth to Apple is that it can account for something like $776B of Apple's $2.9T market cap. Or, looking at this from Google's standpoint, Google's P/E ratio is 27.49, so Google is willing to give up $722B of its $2.17T market cap. Google is willing to pay this to be the default search for something like 25% to 30% of phones in the world. This calculation is too simplistic, but there's no reasonable adjustment that could give anyone the impression that the value of being the default is as trivial as claimed by the BE memo. For reference, a $776B tech company would be 7th most valuable publicly traded U.S. tech company and the 8th most valuable publicly traded U.S. company (behind Meta/Facebook and Berkshire Hathaway, but ahead of Eli Lilly). Another reference is that YouTube's ad revenue in 2021 was $28.8B. It would be difficult to argue that spending one YouTube worth of revenue, in profit, in order to retain default status makes sense if, in practice, user switching costs are trivial and defaults don't matter. If we look for publicly available numbers close to 2012 instead of 2021, in 2013, TechCrunch reported a rumor that Google was paying Apple $1B/yr for search status and a lawsuit then revealed that Google paid Apple $1B for default search status in 2014. This is not longer after these memos are written and $1B/yr is still a non-trivial amount of money and it belies the BE memo's claim that mobile is unimportant and that defaults don't matter because user switching costs are trivial.

It's curious that, given the heavy emphasis in the BE memo on not trusting plausible reasoning and having to rely on empirical data, that BE staff appeared to make no attempt to find out how much Google was paying for its default status (a memo by a director who agrees with BE staff suggests that someone ought to check on this number, but there's no evidence that this was done and the FTC investigation was dropped shortly afterwards). Given the number of internal documents the FTC was able to obtain, it seems unlikely that the FTC would not have been able to obtain this number from either Apple or Google. But, even if it were the case that the number were unobtainable, it's prima facie implausible that defaults don't matter and switching costs are low in practice. If FTC staff interviewed product-oriented engineers and PMs or looked at the history of products in tech, so in order to make this case, BE staff had to ignore or avoid finding out how much Google was paying for default status, not talk to product-focused engineers, PM, or leadership, and also avoid learning about the tech industry.

One could make the case that, while defaults are powerful, companies have been able to overcome being non-default, which could lead to a debate on exactly how powerful defaults are. For example, one might argue about the impact of defaults when Google Chrome became the dominant browser and debate how much of it was due to Chrome simply being a better browser than IE, Opera, and Firefox, how much was due to blunders by Microsoft that Google is unlikely to repeat in search, how much was due to things like tricking people into making Chrome default via a bundle deal with badware installers and how much was due to pressuring people into setting Chrome is default via google.com. That's an interesting discussion where a reasonable person with an understanding of the industry could take either side of the debate, unlike the claim that defaults basically don't matter at all and user switching costs are trivial in practice, which is not plausible even without access to the data on how much Google pays Apple and others to retain default status. And as of the 2020 DoJ case against Google, roughly half of Google searches occur via a default search that Google pays for.

Another repeated error, closely related to the one above, is bringing up marketing statements, press releases, or other statements that are generally understood to be exaggerations, and relying on these as if they're meaningful statements of fact. For example, the BE memo states:

Microsoft's public statements are not consistent with statements made to antitrust regulators. Microsoft CEO Steve Ballmer stated in a press release announcing the search agreement with Yahoo: "This agreement with Yahoo! will provide the scale we need to deliver even more rapid advances in relevancy and usefulness. Microsoft and Yahoo! know there's so much more that search could be. This agreement gives us the scale and resources to create the future of search"

This is the kind of marketing pablum that generally accompanies an acquisition or partnership. Because this kind of meaningless statement is common across many industries, one would expect regulators, even ones with no understanding of tech, to recognize this as marketing and not give it as much or more weight as serious evidence.

A few interesting tidbits

Now that we've covered the main classes of errors observed in the memos, we'll look at a tidbits from the memos.

Between the approval of the compulsory process on June 3rd 2011 and the publication of the BC memo dated August 8th 2012, staff received 9.5M pages of documents across 2M docs and said they reviewed "many thousands of these documents", so staff were only able to review a small fraction of the documents.

Prior to the FTC investigation, there were a number of lawsuits related to the same issues, and all were dismissed, some with arguments that would, if they were taken as broad precedent, make it difficult for any litigation to succeed. In SearchKing v. Google, plaintiffs alleged that Google unfairly demoted their results but it was ruled that Google's rankings are constitutionally protected opinion and even malicious manipulation of rankings would not expose Google to liability. In Kinderstart v. Google, part of the ruling was that Google search is not an essential facility for vertical providers (such as Yelp, eBay, and Expedia). Since the memos are ultimately about legal proceedings, there is, of course, extensive discussion of Verizon v. Trinko and Aspen Skiing Co. v. Aspen Highlands Skiing Corp and the implications thereof.

As of the writing of the BC memo, 96% of Google's $38B in revenue was from ads, mostly from search ads. The BC memo makes the case that other forms of advertising, other than social media ads, only have limited potential for growth. That's certainly wrong in retrospect. For example, video ads are a significant market. YouTube's ad revenue was $28.8B in 2021 (a bit more than what Google pays to Apple to retain default search status), Twitch supposedly generated another $2B-$3B in video revenue, and a fair amount of video ad revenue goes directly from sponsors to streamers without passing through YouTube and Twitch, e.g., the #137th largest streamer on Twitch was offered $10M/yr stream online gambling for 30 minutes a day, and he claims that the #42 largest streamer, who he personally knows, was paid $10M/mo from online gambling sponsorships. And this isn't just apparent in retrospect — even at the time, there were strong signs that video would become a major advertising market. It happens that those same signs also showed that Google was likely to dominate the market for video ads, but it's still the case that the specific argument here was overstated.

In general, the BC memo seems to overstate the expected primacy of search ads as well as how distinct a market search ads are, claiming that other online ad spend is not a substitute in any way and, if anything, is a complement. Although one might be able to reasonably argue that search ads are a somewhat distinct market and the elasticity of substitution is low once you start moving a significant amount of your ad spend away from search, the degree to which the BC memo makes this claim is a stretch. Search ads and other ad budgets being complements and not substitutes is a very different position than I've heard from talking to people about how ad spend is allocated in practice. Perhaps one can argue that it makes sense to try to make a strong case here in light of Person V. Google, where Judge Fogel of the Northern District of California criticized the plaintiff's market definition, finding no basis for distinguishing "search advertising market" from the larger market for internet advertising, which likely foreshadows an objection that would be raised in any future litigation. However, as someone who's just trying to understand the facts of the matter at hand and the veracity of the arguments, the argument here seems dubious.

For Google's integrated products like local search and product search (formerly Froogle), the BC memo claims that if Google treated its own properties like other websites, the products wouldn't be ranked and Google artificially placed their own vertical competitors above organic offerings. The webspam team declined to include Froogle results because the results are exactly the kind of thing that Google removes from the index because it's spammy, saying "[o]ur algorithms specifically look for pages like these to either demote or remove from the index". Bill Brougher, product manager for web search said "Generally we like to have the destination pages in the index, not the aggregated pages. So if our local pages are lists of links to other pages, it's more important that we have the other pages in the index". After the webspam team was overruled and the results were inserted, the ads team complained that the less clicked (and implied to be lower quality) results would lead to a loss of $154M/yr. The response to this essentially contained the same content as the BC memo's argument on the importance of scale and why Google's actions to deprive competitors of scale are costly:

We face strong competition and must move quickly. Turning down onebox would hamper progress as follows - Ranking: Losing click data harms ranking; [t]riggering Losing CTR and google.com query distribution data triggering accuracy; [c]omprehensiveness: Losing traffic harms merchant growth and therefore comprehensiveness; [m]erchant cooperation: Losing traffic reduces effort merchants put into offer data, tax, & shipping; PR: Turning off onebox reduces Google's credibility in commerce; [u]ser awareness: Losing shopping-related UI on google.com reduces awareness of Google's shopping features

Normally, CTR is used as a strong signal to rank results, but this would've resulted in a low ranking for Google's own vertical properties, so "Google used occurrence of competing vertical websites to automatically boost the ranking of its own vertical properties above that of competitors" — if a comparison shopping site was relevant, Google would insert Google Product search above any rival, and if a local search site like Yelp or CitySearch was relevant, Google automatically returned Google Local at top of SERP.

Additionally, in order to see content for Google local results, Google took Yelp content and integrated it into Google Places. When Yelp observed this was happening, they objected to this and Google threatened to ban Yelp from traditional Google search results and further threatened to ban any vertical provider that didn't allow its content to be used in Google Places. Marissa Mayer testified that it was, from a technical standpoint, extraordinarily difficult to remove Yelp from Google Places without also removing Yelp from traditional organic search results. But when Yelp sent a cease and desist letter, Google was able to remove Yelp results immediately, seemingly indicating that it was less difficult than claimed. Google then claimed that it was technically infeasible to remove Yelp from Google Places without removing Yelp from the "local merge" interface on SERP. BC staff believe this claim is false as well, and Marissa Mayer later admitted in a hearing that this claim was false and that Google was concerned about the consequences of allowing sites to opt out of Google Places while staying in "local merge". There was also a very similar story with Amazon results and product search. As noted above, the BE memo's counterargument to all of this is that Google traffic is "very small and not statistically significant"

The BC memo claims that the activities above both reduced incentives of companies Yelp, City Search, Amazon, etc., to invest in the area and also reduced the incentives for new companies to form in this area. This seems true. In addition to the evidence presented in the BC memo (which goes beyond what was summarized above), if you just talked to founders looking for an idea or VCs around the time of the FTC investigation, there had already been a real movement away from founding and funding companies like Yelp because it was understood that Google could seriously cripple any similar company in this space by cutting off its air supply.

We'll defer to the appendix BC memo discussion on the AdWords API restrictions that specifically disallow programmatic porting of campaigns to other platforms, such as Bing. But one interesting bit there is that Google was apparently aware of the legal sensitivity of this matter, so meeting notes and internal documentation on the topic are unusually incomplete. On one meeting, apparently the most informative written record BC staff were able to find consists of a message from Director of PM Richard Holden to SVP of ads Susan Wojicki which reads, "We didn't take notes for obvious reasons hence why I'm not elaborating too much here in email but happy to brief you more verbally".

We'll also defer a detailed discussion of the BC memo comments on Google's exclusive and restrictive syndication agreements to the appendix, except for a couple of funny bits. One is that Google claims they were unaware of the terms and conditions in their standard online service agreements. In particular, the terms and conditions contained a "preferred placement" clause, which a number of parties believe is a de facto exclusivity agreement. When FTC staff questioned Google's VP of search services about this term, the VP claimed they were not aware of this term. Afterwards, Google sent a letter to Barbara Blank of the FTC explaining that they were removing the preferred placement clause in the standard online agreement.

Another funny bit involves Google's market power and how it allowed them to collect an increasingly large share of revenue for themselves and decrease the revenue share their partner received. Only a small number of Google's customers who were impacted by this found this concerning. Those that did find it concerning were some of the largest and most sophisticated customers (such as Amazon and IAC); their concern was that Google's restrictive and exclusive provisions would increase Google's dominance over Bing/Microsoft and allow them to dictate worse terms to customers. Even as Google was executing a systematic strategy to reduce revenue share to customers, which could only be possible due to their dominance of the market, most customers appeared to either not understand the long-term implications of Google's market power in this area or the importance of the internet.

For example, Best Buy didn't find this concerning because Best Buy viewed their website and the web as a way for customers to find presale information before entering a store and Walmart didn't find didn't find this concerning because they viewed the web as an extension to brick and mortar retail. It seems that the same lack of understanding of the importance of the internet which led Walmart and Best Buy to express their lack of concern over Google's dominance here also led to these retailers, which previously had a much stronger position than Amazon, falling greatly behind in both online and overall profit. Walmart later realized its error here and acquired Jet.com for $3.3B in 2016 and also seriously (relative to other retailers) funded programmers to do serious tech work inside Walmart. Since Walmart started taking the internet seriously, it's made a substantial comeback online and has averaged a 30% CAGR in online net sales since 2018, but taking two decades to mount a serious response to Amazon's online presence has put Walmart solidly behind Amazon in online retail despite nearly a decade of serious investment and Best Buy has still not been able to mount an effective response to Amazon after three decades.

The BE memo uses the lack of concern on the part of most customers as evidence that the exclusive and restrictive conditions Google dictated here were not a problem but, in retrospect, it's clear that it was only a lack of understanding of the implications of online business that led customers to be unconcerned here. And when the BE memo refers to the customers who understood the implications here as sophisticated, that's relative to people in lines of business where leadership tended to not understand the internet. While these customers are sophisticated by comparison to a retailer that took two decades to mount a serious response to the threat Amazon poses to their business, if you just talked to people in the tech industry at the time, you wouldn't need to find a particularly sophisticated individual to find someone who understood what was going on. It was generally understood that retail revenue and even moreso, retail profit was going to move online, and you'd have to find someone who was extremely unusually out of the loop to find someone who didn't at least roughly understand the implications here.

There's a lengthy discussion on search and scale in both the BC and BE memos. On this topic, the BE memo seems wrong and the implications of the BC memo are, if not subtle, at least not obvious. Let's start with the BE memo because that one's simpler to discuss, although we'll very briefly discuss the argument in the BC memo in order to frame the discussion in the BE memo. A rough sketch of the argument in the BC memo is that there are multiple markets (search, ads) where scale has a significant impact on product quality. Google's own documents acknowledge this "virtuous cycle" where having more users lets you serve better ads, which gives you better revenue for ads and, likewise in search, having more scale gives you more data which can be used to improve results, which leads to user growth. And for search in particular, the BC memo claims that click data from users is of high importance and that more data allows for better results.

The BE memo claims that this is not really the case. On the importance of click data, the BE memo raises two large objections. First, that this is "contrary to the history of the general search market" and second, that "it is also contrary to the evidence that factors such as the quality of the web crawler and web index; quality of the search algorithm; and the type of content included in the search results [are as important or more important].

Of the first argument, the BE memo elaborates with a case that's roughly "Google used to be smaller than it is today, and the click data at the time was sufficient, therefore being as large as Google used to be means that you have sufficient click data". Independent of knowledge of the tech industry, this seems like a strange line of reasoning. "We now produce a product that's 1/3 as good as our competitor for the same price, but that should be fine because our competitor previously produced a product that's 1/3 as good as their current product when the market was less mature and no one was producing a better product" is generally not going to be a winning move. That's especially true in markets where there's a virtuous cycle between market share and product quality, like in search.

The second argument also seems like a strange argument to make even without knowledge of the tech industry in that it's a classic fallacious argument. It's analogous to saying something like "the BC memo claims that it's important for cars to have a right front tire, but that's contrary to evidence that it's at least as important for a car to have a left front tire and a right rear tire". The argument is even less plausible if you understand tech, especially search. Calling out the quality of the search algorithm as distinct doesn't feel quite right because scale and click data directly feed into algorithm development (and this is discussed at some length in the BE memo — the authors of the BC memo surely had access to the same information and, from their writing, seem to have had access to the argument). And as someone who's worked on search indexing, as much as I'd like to be agree with the BE memo and say that indexing is as important or more important than ranking, I have to admit that indexing is an easier and less important problem than ranking and likewise for crawling vs. ranking. This was generally understood at the time so, given the number of interviews FTC staff did, the authors of the BE memo should've known this as well. Moreover, given the "history of the general search market" which the BE memo refers to, even without talking to engineers, this should've been apparent.

For example, Cuil was famous for building a larger index than Google. While that's not a trivial endeavor, at the time, quite a few people had the expertise to build an index that rivaled Google's index in raw size or whatever other indexing metric you prefer, if given enough funding for a serious infra startup. Cuil and other index-focused attempts failed because having a large index without good search ranking is worth little. While it's technically true that having good ranking with a poor index is also worth little, this is not something we've really seen in practice because ranking is the much harder problem and a company that's competent to build a good search ranker will, as a matter of course, have a good enough index and good enough crawling.

As for the case in the BC memo, I don't know what the implications should be. The BC memo correctly points out that increased scale greatly improves search quality, that the extra data Bing got from the Yahoo greatly increased search quality and increased CTR, that further increased scale should be expected to continue to provide high return, that the costs of creating a competitor to Google are high (Bing was said to be losing $2B/yr at the time and was said to be spending $4.5B/yr "developing its algorithms and building the physical capacity necessary to operate Bing"), and that Google undertook actions that might be deemed anticompetitive which disadvantaged Bing's compared to the counterfactual world where Google did not take those actionts, and they make a similar case for ads. However, despite the strength of the stated BC memo case and the incorrectness of the stated BE memo case, the BE memo's case is correct in spirit, in that there are actions Microsoft could've taken but did not in order to compete much more effectively in search and one could argue that the FTC shouldn't be in the business of rescuing a company from competing ineffectively.

Personally, I don't think it's too interesting to discuss the position of the BC memo vs. the BE memo at length because the positions the BE memo takes seem extremely weak. It's not fair to call it a straw man because it's a real position, and one that carried the day at the FTC, but the decision to take action or not seemed more about philosophy than the arguments in the memos. But we can discuss what else might've been done.

What might've happened

What happened after the FTC declined to pursue antitrust action was that Microsoft effectively defunded Bing as a serious bet, taking resources that could've gone to continuing to fund a very expensive fight against Google, and moving them to other bets that it deemed to be higher ROI. The big bets Microsoft pursued were Azure, Office, and HoloLens (and arguably Xbox). Hololens was a pie-in-the-sky bet, but Azure and Office were lines of business where Microsoft could, instead of fighting an uphill battle where their competitor can use its dominance in related markets to push around competitors, Microsoft could fight downhill battles where they can use their dominance in related markets to push around competitors, resulting in a much higher return per dollar invested. As someone who worked on Bing and thought that BIng had the potential to seriously compete with Google given sustained, unprofitable, heavy investment, I find that disappointing but also likely the correct business decision. If you look at any particular submarket, like Teams vs. Slack, the Microsoft product doesn't need to be nearly as good as the competing product to take over the market, which is the opposite of the case in search, where Google's ability to push competitors around means that Bing would have to be much better than Google to attain marketshare parity.

Based on their public statements, Biden's DoJ Antitrust AAG appointee, Jonathan Kanter, would argue for pursuing antitrust action under the circumstances, as would Biden's FTC commissioner and chair appointee Lina Khan. Prior to her appointment as FTC commissioner and chair, Khan was probably best known for writing Amazon's Antitrust Paradox, which has been influential as well as controversial. Obama appointees, who more frequently agreed with the kind of reasoning from the BE memo, would have argued against antitrust action and the investigation under discussion was stopped on their watch. More broadly, they argued against the philosophy driving Kanter and Khan. Obama's FTC Commissioner appointee, GMU economist and legal scholar Josh Wright actually wrote a rebuttal titled "Requiem for a Paradox: The Dubious Rise and Inevitable Fall of Hipster Antitrust", a scathing critique of Khan's position.

If, in 2012, the FTC and DoJ were run by Biden appointees instead of Obama appointees, what difference would that have made? We can only speculate, but one possibility would be that they would've taken action and then lost, as happened with the recent cases against Meta and Microsoft which seem like they would not have been undertaken under an Obama FTC and DoJ. Under Biden appointees, there's been much more vigorous use of the laws that are on the books, the Sherman Act, the Clayton Act, the FTC Act, the Robinson–Patman Act, as well as "smaller" antitrust laws, but the opinion of the courts hasn't changed under Biden and this has led to a number of unsuccessful antitrust cases in tech. Both the BE and BC memos dedicate significant space to whether or not a particular line of reasoning will hold up in court. Biden's appointees are much less concerned with this than previous appointees and multiple people in the DoJ and the FTC are on the record saying things like "it is our duty to enforce the law", meaning that when they see violations of the antitrust laws that were put into place by elected officials, it's their job to pursue these violations even if courts may not agree with the law.

Another possibility is that there would've been some action, but the action would've been in line with most corporate penalties we see. Something like a small fine that costs the company an insignificant fraction of marginal profit they made from their actions, or some kind of consent decree (basically a cease and desist), where the company will be required to stop doing specific actions while keeping their marketshare, keeping the main thing they wanted to gain, a massive advantage in a market dominated by network effects. Perhaps there will be a few more meetings where "[w]e didn't take notes for obvious reasons" to work around the new limitations and business as usual will continue. Given the specific allegations in the FTC memos and the attitudes of the courts at the time, my guess is that something like this second set of possibilities would've been the most likely outcome had the FTC proceeded with their antitrust investigation instead of dropping it, some kind of nominal victory that makes little to no difference in practice. Given how long it takes for these cases to play out, it's overwhelmingly likely that Microsoft would've already scaled back its investment in Bing and moved Bing from a subsidized bet it was trying to grow to a profitable business it wanted to keep by the time any decision was made. There are a number of cases that were brought by other countries which had remedies that were in line with what we might've expected if the FTC investigation continued. On Google using market power in mobile to push software Google wants to nearly all Android phones, an EU and was nominally successful but made little to no difference in practice. Cristina Caffara of the Centre for Economic Policy Research characterized this as

Europe has failed to drive change on the ground. Why? Because we told them, don't do it again, bad dog, don't do it again. But in fact, they all went and said 'ok, ok', and then went out, ran back from the back door and did it again, because they're smarter than the regulator, right? And that's what happens.

So, on the tying case, in Android, the issue was, don't tie again so they say, "ok, we don't tie". Now we got a new system. If you want Google Play Store, you pay $100. But if you want to put search in every entry point, you get a discount of $100 ... the remedy failed, and everyone else says, "oh, that's a nice way to think about it, very clever"

Another pair of related cases are Yandex's Russian case on mobile search defaults and a later EU consent decree. In 2015, Yandex brought a suit about mobile default status on Android in Russia, which was settled by adding a "choice screen" which has users pick their search engine without preferencing a default. This immediately caused Yandex to start gaining marketshare on Google and Yandex eventually surpassed Google in marketshare in Russia according to statcounter. In 2018, the EU required a similar choice screen in Europe, which didn't make much of a difference, except maybe sort of in the Czech republic. There are a number of differences between the situation in Russia and in the EU. One, arguably the most important, is that when Yandex brought the case against Google in Russia, Yandex was still fairly competitive, with marketshare in the high 30% range. At the time of the EU decision in 2018, Bing was the #2 search engine in Europe, with about 3.6% marketshare. Giving consumers a choice when one search engine completely dominates the market can be expected to have fairly little impact. One argument the BE memo heavily relies on is the idea that, if we intervene in any way, that could have bad effects down the line, so we should be very careful and probably not do anything, just in case. But in these winner-take-most markets with such strong network effects, there's a relatively small window in which you can cheaply intervene. Perhaps, and this is highly speculative, if the FTC required a choice screen in 2012, Bing would've continued to invest enough to at least maintain its marketshare against Google.

For verticals, in shopping, the EU required some changes to how Google presents results in 2017. This appears to have had little to no impact, being both perhaps 5-10 years too late and also a trivial change that wouldn't have made much difference even if enacted a decade earlier. The 2017 ruling came out of a case that started in 2010, and in the 7 years it took to take action, Google managed to outcompete its vertical competitors, making them barely relevant at best.

Another place we could look is at the Microsoft antitrust trial. That's a long story, at least as long as this document, but to very briefly summarize, in 1990, the FTC started an investigation over Microsoft's allegedly anticompetitive conduct. A vote to continue the investigation ended up in a 2-2 tie, causing the investigation to be closed. The DoJ then did its own investigation, which led to a consent decree that was generally considered to not be too effective. There was then a 1998 suit by the DoJ about Microsoft's use of monopoly power in the browser market, which initially led to a decision to break Microsoft up. But, on appeal, the breakup was overturned, which led to a settlement in 2002. A major component of the 1998 case was about browser bundling and Microsoft's attack on Netscape. By the time the case was settled, in 2002, Netscape was effectively dead. The parts of the settlements having to do with interoperability were widely regarded as ineffective at the time, not only because Netscape was dead, but because they weren't going to be generally useful. A number of economists took the same position as the BE memo, that no intervention should've happened at the time and that any intervention is dangerous and could lead to a fettering of innovation. Nobel Prize winning economist Milton Friedman wrote a Cato Policy Forum essay titled "The Business Community's Suicidal Impulse", predicting that tech companies calling for antitrust action against Microsoft were committing suicide, and that a critical threshold had been passed and that this would lead to the bureaucratization of Silicon Valley

When I started in this business, as a believer in competition, I was a great supporter of antitrust laws; I thought enforcing them was one of the few desirable things that the government could do to promote more competition. But as I watched what actually happened, I saw that, instead of promoting competition, antitrust laws tended to do exactly the opposite, because they tended, like so many government activities, to be taken over by the people they were supposed to regulate and control. And so over time I have gradually come to the conclusion that antitrust laws do far more harm than good and that we would be better off if we didn’t have them at all, if we could get rid of them. But we do have them.

Under the circumstances, given that we do have antitrust laws, is it really in the self-interest of Silicon Valley to set the government on Microsoft? ... you will rue the day when you called in the government. From now on the computer industry, which has been very fortunate in that it has been relatively free of government intrusion, will experience a continuous increase in government regulation. Antitrust very quickly becomes regulation. Here again is a case that seems to me to illustrate the suicidal impulse of the business community.

In retrospect, we can see that this wasn't correct and, if anything, was the opposite of correct. On the idea that even attempting antirust action against Microsoft would lead to an inevitable increase in government intervention, we saw the opposite, a two-decade long period of relatively light regulation and antitrust activity. And in terms of the impacts on innovation, although the case against Microsoft was too little and too late to save Netscape, Google's success appears to be causally linked to the antitrust trial. At one point, in the early days of Google, when Google had no market power and Microsoft effectively controlled how people access the internet, Microsoft internally discussed proposals aimed at killing Google. One proposal involved redirecting users who tried to navigate to Google to Bing (at the time, called MSN Search, and of course this was before Chrome existed and IE dominated the browser market). Another idea was to put up a big scary warning that warned users that Google was dangerous, much like the malware warnings browsers have today. Gene Burrus, a lawyer for Microsoft at the time, stated that Microsoft chose not to attempt to stop users from navigating to google.com due to concerns about further antitrust action after they'd been through nearly a decade of serious antitrust scrutiny. People at both Google and Microsoft who were interviewed about this both believe that Microsoft would've killed Google had they done this so, in retrospect, we can see that Milton Friedman was wrong about the impacts of the Microsoft antitrust investigations and that one can make the case that it's only because of the antitrust investigations that web 1.0 companies like Google and Facebook were able to survive, let alone flourish.

Another possibility is that a significant antitrust action would've been undertaken, been successful, and been successful quickly enough to matter. It's possible that, by itself, a remedy wouldn't have changed the equation for Bing vs. Google, but if a reasonable remedy was found and enacted, it still could've been in time to keep Yelp and other vertical sites as serious concerns and maybe even spur more vertical startups. And in the hypothetical universe where people with the same philosophy as Biden's appointees were running the FTC and the DoJ, we might've also seen antitrust action against Microsoft in markets where they can leverage their dominance in adjacent markets, making Bing a more appealing area for continued heavy investment. Perhaps that would've resulted in Bing being competitive with Google and the aforementioned concerns that "sophisticated customers" like Amazon and IAC had may not have come to pass. With antitrust against Microsoft and other large companies that can use their dominance to push competitors around, perhaps Slack would still be an independent product and we'd see more startups in enterprise tools (a number of commenters believe that Slack was basically forced into being acquired because it's too difficult to compete with Teams given Microsoft's dominance in related markets). And Slack continuing to exist and innovate is small potatoes — the larger hypothetical impact would be all of the new startups and products that would be created that no one even bothers to attempt because they're concerned that a behemoth with an integrated bundle like Microsoft would crush their standalone product. If you add up all of these, if not best-case, at least very-good-case outcomes for antitrust advocates, one could argue that consumers and businesses would be better off. But, realistically, it's hard to see how this very-good-case set of outcomes could have come to pass.

Coming back to the FTC memo, if we think about what it would take to put together a set of antitrust actions that actually fosters real competition, that seems extraordinarily difficult. A number of the more straightforward and plausible sounding solutions are off the table for political reasons, due to legal precedent, or due to arguments like the Boies argument we referenced or some of the arguments in the BE memo that are clearly incorrect, but appear to be convincing to very important people.

For the solutions that seem to be on the table, weighing the harms caused by them is non-trivial. For example, let's say the FTC mandated a mobile and desktop choice screen in 2012. This would've killed Mozilla in fairly short order unless Mozilla completely changed its business model because Mozilla basically relies on payments from Google for default status to survive. We've seen with Opera that even when you have a superior browser that introduces features that other browsers later copy, which has better performance than other browsers, etc., you can't really compete with free browsers when you have a paid browser. So then we would've quickly been down to IE/Edge and Chrome. And in terms of browser engines, just Chrome after not too long as Edge is now running Chrome under the hood. Maybe we can come up with another remedy that allows for browser competition as well, but the BE memo isn't wrong to note that antitrust remedies can cause other harms.

Another example which highlights the difficulty of crafting a politically suitable remedy are the restrictions the Bundeskartellamt imposed against Facebook, which have to do with user privacy and use of data (for personalization, ranking, general ML training, etc.), which is considered an antitrust issue in Germany. Michal Gal, Professor and Director of the Forum on Law and Markets at the University of Haifa pointed out that, of course Facebook, in response to the rulings, is careful to only limit its use of data if Facebook detects that you're German. If the concern is that ML models are trained on user data, this doesn't do much to impair Facebook's capability. Hypothetically, if Germany had a tech scene that was competitive with American tech and German companies were concerned about a similar ruling being leveled against them, this would be disadvantageous to nascent German companies that initially focus on the German market before expanding internationally. For Germany, this is only a theoretical concern as, other than SAP, no German company has even approached the size and scope of large American tech companies. But when looking at American remedies and American regulation, this isn't a theoretical concern, and some lawmakers will want to weigh the protection of American consumers against the drag imposed on American firms when compared to Korean, Chinese, and other foreign firms that can grow in local markets with fewer privacy concerns before expanding to international markets. This concern, if taken seriously, could be used to argue against nearly any pro-antitrust action argument.

What can we do going forward?

This document is already long enough, so we'll defer a detailed discussion of policy specifics for another time, but in terms of high-level actions, one thing that seems like it would be helpful is to have tech people intimately involved in crafting remedies and regulation as well as during investigations². From the directors memos on the 2011-2021 FTC investigation that are publicly available, it would appear this was not done because the arguments from the BE memos that wouldn't pass the sniff test for a tech person appear to have been taken seriously. Another example is the one EU remedy that Cristina Caffara noted was immediately worked around by Google, in a way that many people in tech would find to be a delightful "hack".

There's a long history of this kind of "hacking the system" being lauded in tech going back to before anyone called it "tech" and it was just physics and electrical engineering. To pick a more recent example, one of the reasons Sam Altman become President of Y Combinator, which eventually led to him becoming CEO of Open AI was that Paul Graham admired his ability to hack systems; in his 2010 essay on founders, under the section titled "Naughtiness", Paul wrote:

Though the most successful founders are usually good people, they tend to have a piratical gleam in their eye. They're not Goody Two-Shoes type good. Morally, they care about getting the big questions right, but not about observing proprieties. That's why I'd use the word naughty rather than evil. They delight in breaking rules, but not rules that matter. This quality may be redundant though; it may be implied by imagination.

Sam Altman of Loopt is one of the most successful alumni, so we asked him what question we could put on the Y Combinator application that would help us discover more people like him. He said to ask about a time when they'd hacked something to their advantage—hacked in the sense of beating the system, not breaking into computers. It has become one of the questions we pay most attention to when judging applications.

Or, to pick one of countless examples from Google, in order to reduce travel costs at Google, Google engineers implemented a system where they computed some kind of baseline "expected cost for flights, and then gave people a credit for taking flights that came in under the baseline costs that could be used to upgrade future flights and travel accommodations. This was a nice experience for employees compared to what stodgier companies were doing in terms of expense limits and Google engineers were proud of creating a system that made things better for everyone, which was one kind of hacking the system. The next level of hacking the system was when some employees optimized their flights and even set up trips to locations that were highly optimizable (many engineers would consider this a fun challenge, a variant of classic dynamic programming problems that are given in interviews, etc.), allowing them to upgrade to first class flights and the nicest hotels.

When I've talked about this with people in management in traditional industries, they've frequently been horrified and can't believe that these employees weren't censured or even fired for cheating the system. But when I was at Google, people generally found this to be admirable, as it exemplified the hacker spirit.

We can see, from the history of antitrust in tech going back at least two decades, that courts, regulators, and legislators have not been prepared for the vigor, speed, and delight with which tech companies hack the system.

And there's precedent for bringing in tech folks to work on the other side of the table. For example, this was done in the big Microsoft antitrust case. But there are incentive issues that make this difficult at every level that stem from, among other things, the sheer amount of money that tech companies are willing to pay out. If I think about tech folks I know who are very good at the kind of hacking the system described here, the ones who want to be employed at big companies frequently make seven figures (or more) annually, a sum not likely to be rivaled by an individual consulting contract with the DoJ or FTC. If we look at the example of Microsoft again, the tech group that was involved was managed by Ron Schnell, who was taking a break from working after his third exit, but people like that are relatively few and far between. Of course there are people who don't want to work at big companies for a variety of reasons, often moral reasons or a dislike of big company corporate politics, but most people I know who fit that description haven't spent enough time at big companies to really understand the mechanics of how big companies operate and are the wrong people for this job even if they're great engineers and great hackers.

At an antitrust conference a while back, a speaker noted that the mixing and collaboration between the legal and economics communities was a great boon for antitrust work. Notably absent from the speech as well as the conference were practitioners from industry. The conference had the feel of an academic conference, so you might see CS academics at the conference some day, but even if that were to happen, many of the policy-level discussions are ones that are outside the area of interest of CS academics. For example, one of the arguments from the BE memo that we noted as implausible was the way they used MAU to basically argue that switching costs were low. That's something outside the area of research of almost every CS academic, so even if the conference were to expand and bring in folks who work closely with tech, the natural attendees would still not be the right people to weigh in on the topic when it comes to the plausibility of nitty gritty details.

Besides the aforementioned impact on policy discussions, the lack of collaboration with tech folks also meant that, when people spoke about the motives of actors, they would often make assumptions that were unwarranted. On one specific example of what someone might call a hack of the system, the speaker described an exec's reaction (high-fives, etc.), and inferred a contempt for lawmakers and the law that was not in evidence. It's possible the exec in question does, in fact, have a contempt and disdain for lawmakers and the law, but that celebration is exactly what you might've seen after someone at Google figured out how to get upgraded to first class "for free" on almost all their flights by hacking the system at Google, which wouldn't indicate contempt or disdain at all.

Coming back to the incentive problem, it goes beyond getting people who understand tech on the other side of the table in antitrust discussions. If you ask Capitol Hill staffers who were around at the time, the general belief is that the primary factor that scuttled the FTC investigation was Google's lobbying, and of course Google and other large tech companies spend more on lobbying than entities that are interested in increased antitrust scrutiny.

And in the civil service, if we look at the lead of the BC investigation and the first author on the BC memo, they're now Director and Associate General Counsel of Competition and Regulatory Affairs at Facebook. I don't know them, so I can't speak to their motivations, but if I were offered as much money as I expect they make to work on antitrust and other regulatory issues at Facebook, I'd probably take the offer. Even putting aside the pay, if I was a strong believer in the goals of increased antitrust enforcement, that would still be a very compelling offer. Working for the FTC, maybe you lead another investigation where you write a memo that's much stronger than the opposition memo, which doesn't matter when a big tech company pours more lobbying money into D.C. and the investigation is closed. Or maybe your investigation leads to an outcome like the EU investigation that led to a "choice screen" that was too little and far too late. Or maybe it leads to something like the Android Play Store untying case where, seven years after the investigation was started, an enterprising Google employee figures out a "hack" that makes the consent decree useless in about five minutes. At least inside Facebook, you can nudge the company towards what you think is right and have some impact on how Facebook treats consumers and competitors.

Looking at it from the standpoint of people in tech (as opposed to people working in antitrust), in my extended social circles, it's common to hear people say "I'd never work at company X for moral reasons". That's a fine position to take but, almost everyone I know who does this ends up working at a much smaller company that has almost no impact on the world. If you want to take a moral stand, you're more likely to make a difference by working from the inside or finding a smaller direct competitor and helping it become more successful.

Thanks to Laurence Tratt, Yossi Kreinin, Justin Hong, kouhai@treehouse.systems, Sophia Wisdom, @cursv@ioc.exchange, @quanticle@mastodon.social, and Misha Yagudin for comments/corrections/discussion

Appendix: non-statements

This is analogous to the "non-goals" section of a technical design doc, but weaker, in that a non-goal in a design doc is often a positive statement that implies something that couldn't be implied from reading the doc, whereas the non-goal statements themselves don't add any informatio

Antitrust action against Google should have been pursued in 2012
- Not that anyone should care what my opinion is, but if you'd asked me at the time if antitrust action should be pursued, I would've said "probably not". The case for antitrust action seems stronger now and the case against seems weaker, but you could still mount a fairly strong argument against antitrust action today.
- Even if you believe that, ceteris paribus, antitrust action would've been good for consumers and the "very good case" outcome in "what might've happened" would occur if antitrust action were pursued, it's still not obvious that Google and other tech companies are the right target as opposed to (just for example) Visa and Mastercard's dominance of payments, hospital mergers leading to increased concentration that's had negative impacts on both consumers and workers, Ticketmaster's dominance, etc.. Or perhaps you think the government should focus on areas where regulation specifically protects firms, such as in shipping (which is except from the Sherman Act) or car dealerships (which have special protections in the law in many U.S. states that prevent direct sales and compel car companies to abide by their demands in certain ways), etc.
Weaker or stronger antitrust measures should be taken today
- I don't think I've spent enough time reading up on the legal, political, historical, and philosophical background to have an opinion on what should be done, but I know enough about tech to point out a few errors that I've seen and to call out common themes in these errors.

BC Staff Memo

By "Barbara R. Blank, Gustav P. Chiarello, Melissa Westman-Cherry, Matthew Accornero, Jennifer Nagle, Anticompetitive Practices Division; James Rhilinger, Healthcare Division; James Frost, Office of Policy and Coordination; Priya B. Viswanath, Office of the Director; Stuart Hirschfeld, Danica Noble, Northwest Region; Thomas Dahdouh, Western Region-San Francisco, Attorneys; Daniel Gross, Robert Hilliard, Catherine McNally, Cristobal Ramon, Sarah Sajewski, Brian Stone, Honors Paralegals; Stephanie Langley, Investigator"

Dated August 8, 2012

Executive Summary

Google is dominant search engine and seller of search ads
This memo addresses 4 of 5 areas with anticompetitive conduct; mobile is in a supplemental memo
Google has monopoly power in the U.S. in Horizontal Search; Search Advertising; and Syndicated Search and Search Advertising
On the question of whether Google has unlawfully preferenced its own content while demoting rivals, we do not recommend the FTC proceed; it's a close call and case law is not favorable to anticompetitive product design and Google's efficiency justifications are strong and at there's some benefit to users
On whether Google has unlawfully scraped content from vertical rivals to improve their own vertical products, recommending condemning as a conditional refusal to deal under Section 2
- Prior voluntary dealing was mutually beneficial
- Threats to remove rival content from general search designed to coerce rivals into allowing Google to user their content for Google's vertical product
- Natural and probable effect is to diminish incentives of vertical website R&D
On anticompetitive contractual restrictions on automated cross-management of ad campaigns, restrictions should be condemned under Section 2
- They limit ability of advertisers to make use of their own data, reducing innovation and increasing transaction costs for advertisers and third-party businesses
- Also degrade the quality of Google's rivals in search and search advertising
- Google's efficiency justifications appears to be pretextual
On anticompetitive exclusionary agreements with websites for syndicated search and search ads, Google should be condemned under Section 2
- Only modest anticompetitive effects on publishers, but deny scale to competitors, competitively significant to main rival (Bing) as well as significant barrier to entry in longer term
- Google's efficiency justifications are, on balance, non-persuasive
Possible remedies
- Scraping
  - Could be required to provide an opt-out for snippets (reviews, ratings) from Google's vertical properties while retaining snippets in web search and/or Universal Search on main search results page
  - Could be required to limit use of content indexed from web search results
- Campaign management restrictions
  - Could be required to remove problematic contractual restrictions from license agreements
- Exclusionary syndication agreements
  - Could be enjoined from entering into exclusive search agreements with search syndication partners and required to loosen restrictions surrounding syndication partners' use of rival search ads
There are a number of risks to case, not named in summary except that Google can argue that Microsoft's most efficient distribution channel is bing.com and that any scale MS might gain will be immaterial to Bing's competitive position
Staff concludes Google's conduct has resulted and will result in real harm to consumer, innovation in online search and ads.

A. FTC INVESTIGATION

Compulsory process approved on June 03 2011
Received over 2M docs (9.5M pages) "and have reviewed many thousands of those documents"
Reviewed documents procured to DoJ in Google-Yahoo (2008) and ITA (2010) investigations and documents produced in response to European Commission and U.S. State investigations
Interviewed dozens of parties including vertical competitors in travel, local, finance, and retail; U.. advertisers and ad agencies; Google U.S. syndication and distribution partners; mobile device manufacturers and wireless carriers
17 investigational hearings of Google execs & employees

B. EUROPEAN COMMISSION INVESTIGATION

Parallel investigation since November 2010
May 21, 2012: Commissioner Joaquin Almunia issued letter signaling EC's possible intent to issue Statement of Objections for abuse of dominance in violation of Article 102 of EC Treaty
- Concerns
  - "favourable treatment of its own vertical search services as compared to those of its competitors in its natural search results"
  - "practice of copying third party content" to supplement own vertical content
  - "exclusivity agreements with publishers for the provision of search advertising intermediation services"
  - "restrictions with regard to the portability and cross-platform management of online advertising campaigns"
- offered opportunity to resolve concerns prior to issuance of SO by producing description of solutions
- Google denied infringement of EU law, but proposed several commitments to address stated concerns
FTC staff coordinated with EC staff

C. MULTI-STATE INVESTIGATION

Texas investigating since June 2010, leader of multi-state working group
FTC working closely with states

D. PRIVATE LITIGATION

Several private lawsuits related to issues in our investigation; all dismissed
Two categories, manipulation of search rankings and increases in minimum prices for AdWords search ads
Kinderstart.com LLC v. Google, Inc.,1 ¹¹ and SearchKing, Inc. v. Google Tech., Inc., plaintiffs alleged that Google unfairly demoted their results
- SearchKing court ruled that Google's rankings are constitutionally protected opinion; even malicious manipulation of rankings would not expose Google to tort liability
- Kinderstart court rejected Google search being an essential facility for vertical websites
In AdsWords cases, plaintiffs argue that Google increased minimum bids for keywords they'd purchases, making those keywords effectively unavailable, depriving plaintiff website of traffic
- TradeComet.com, LLC v. Google, Inc. dismissed for improper venue and Google, Inc. v. myTriggers.com, Inc. dismissed for failing to describe harm to competition has a whole
  - both dismissed with little discussion of merits
- Person V. Google, Inc.: Judge Fogel of the Northern District of California criticized plaintiff's market definition, finding no basis for distinguishing "search advertising market" from larger market for internet advertising

II. STATEMENT OF FACTS

A. THE PARTIES

1. Google

Products include "horizontal" search engine and integrated "vertical" websites that focus on specific areas (product or shopping comparisons, maps, finance, books, video), search advertising via AdWords, search and search advertising syndication through AdSense, computer and software applications such as Google Toolbar, Gmail, Chrome, also have Android for mobile and applications for mobile devices and recently acquired Motorola Mobility
32k people, $38B annual revenue

2. General search competitors

a. Microsoft

MSN search released in 1998, rebranded Bing in 2009. Filed complaints against Google in 2011 with FTC and EC

b. Yahoo

Partnership with Bing since 2010; Bing provides search results and parties jointly operate a search ad network

3. Major Vertical Competition

In general, these companies complain that Google's practice of preferencing its own vertical results has negatively impacted ability to compete for users and advertisers
Amazon
- Product search directly competes with Google Product Search
eBay
- product search competes with Google Product Search
NexTag
- shopping comparison website that competes with Google Product Search
Foundem
- UK product comparison website that competes with Google Product Search
- Complaint to EC, among others, prompted EC to open its investigation into Google's web search practices
- First vertical website to publicly accuse Google of preferencing its own vertical content over competitors on Google's search page
Expedia
- competes against Google's fledgling Google Flight Search
TripAdvisor
- TripAdvisor competes with Google Local (formerly Google Places)
- has complained that Google has appropriated / scraped its user-generated reviews, placing them on Google's own local property
Yelp
- has complained that Google has appropriated / scraped its user-generated reviews, placing them on Google's own local property
Facebook
- Competes with Google's recently introduced Google Plus
- has complained that Google's preferencing of Google Plus results over Facebook results is negatively impacting ability to compete for users

B. INDUSTRY BACKGROUND

1. General Search

[nice description of search engines for lay people omitted]

2. Online Advertising

Google's core business is ads; 96% of its nearly $38B in revenue was from ad sales
[lots of explanations of ad industry for lay people, mostly omitted]
Reasons advertisers have shifted business to web include high degree of tracking possible and quantifiable, superior, ROI
Search ads make up most of online ad spend, primarily because advertisers believe search ads provided best precision in IDing customers, measurability, and the highest ROI
Online advertising continues to evolve, with new offerings that aren't traditional display or search ads, such as contextual ads, re-targeted behavioral ads, and social media ads
- these new ad products don't account for a significant portion of online ads today and, with the exception of social media ads, appear to have only limited potential for growth [Surely video is pretty big now, especially if you include "sponsorships" and not just ads inserted by the platform?]

3. Syndicated Search and Search Advertising

Search engines "syndicate" search and/or search ads
- E.g., if you go "AOL or Ask.com", you can do a search which is powered by a search Provider, like Google
Publisher gets to keep user on own platform, search provider gets search volume and can monetize traffic
- End-user doesn't pay; publisher pays Google either on cost-per-user-query basis or by accepting search ads and spitting revenues from search ads run on publisher's site. Revenue sharing agreement often called "traffic acquisition cost" (TAC)
Publishers can get search ads without offering search (AdSense) and vice versa

4. Mobile Search

Focus of search has been moving from desktop to "rapid emerging — and lucrative — frontier of mobile"
Android at forefront; has surpassed iPhone in U.S. market share
Mobile creates opportunities for location-based search ads; even more precise intent targeting than desktop search ads
Google and others have signed distribution agreements with device makers and wireless carriers, so user-purchased devices usually come pre-installed with search and other apps

C. THE SIGNIFICANCE OF SCALE IN INTERNET SEARCH

Scale (user queries and ad volume) important to competitive dynamics

1. Search Query Volume

Microsoft claims it needs higher query volume to improve Bing
- Logs of queries can be used to improve tail queries
- Suggestions, instant search, spelling correction
- Trend identification, fresh news stories
Click data important for evaluating search quality
- Udi Manber (former Google chief of search quality) testimony: "The ranking itself is affected by the click data. If we discover that, for a particular query, hypothetically, 80 percent of people click on Result No. 2 and only 10 percent click on Result No. 1, after a while we figure out, well, probably Result 2 is the one people want. So we'll switch it."
- Testimony from Eric Schmidt and Sergey Brin confirms click data important and provides feedback on quality of search results
- Scale / volume allows more experiments
  - Larry and Sergei's annual letter in 2005 notes importance of experiments, running multiple simultaneous experiments
  - More scale allows for more experiments as well as for experiments to complete more quickly
  - Susan Athey (Microsoft chief economist) says Microsoft search quality team is greatly hampered by insufficient search volume to run experiments
2009 comment from Udi Manber: "The bottom line is this. If Microsoft had the same traffic we have their quality will improve *significantly*, and if we had the same traffic they have, ours will drop significantly. That's a fact"

2. Advertising Volume

Microsoft claims they need more ad volume to improve relevance and quality of ads
- More ads means more choices over what ads to serve to use, better matched ads / higher conversion rates
- Also means more queries
- Also has similar feedback loop to search
Increase volume of advertisers increases competitiveness for ad properties, gives more revenue to search engine
- Allows search engine to amortize costs, re-invest in R&D, provide better advertiser coverage, revenue through revenue-sharing agreements to syndication partners (website publishers). Greater revenue to partners attracts more publishers and more advertisers

3. Scale Curve

Google acknowledges the important of scale (outside of the scope of this particular discussion)
Google documents replete with references to "virtuous cycle" among users, advertisers, and publishers
- Testimony from Google execs confirms this
But Google argues scale no longer matters at Google's scale or Microsoft's scale, that additional scale at Microsoft's scale would not "significantly improve" Microsoft search quality
Susan Athey argues that relative scale, Bing being 1/5th the size of Google, matters, not absolute size
Microsoft claims that 5% to 10% increase in query volume would be "very meaningful", notes that gaining access to Yahoo queries and ad volume in 2010 was significant for search quality and monetization
- Claim that Yahoo query data increased click through rate for "auto suggest" from 44% to 61% [the timeframe here is July 2010 to September 2011 — too bad they didn't provide an A/B test here, since this more than 1 year timeframe allows for many other changes to impact the suggest feature as well; did that ship a major change here without A/B testing it? That seems odd]
Microsoft also claims search quality improvements due to experiment volume enabled by extra query volume

D. GOOGLE'S SUSPECT CONDUCT

Five main areas of staff investigation of alleged anticompetitive conduct:

1. Google's Preferencing of Google Vertical Properties Within Its Search Engine Results Page ("SERP")

Allegation is that Google's conduct is anticompetitive because "it forecloses alternative search platforms that might operate to constraint Google's dominance in search and search advertising"
" Although it is a close call, we do not recommend that the Commission issue a complaint against Google for this conduct."

a. Overview of Changes to Google's SERP

Google makes changes to UI and algorithms, sometimes without user testing
sometimes with testing with launch review process, typically including:
- "the sandbox", internal testing by engineers
- "SxS", side-by-side testing by external raters who compare existing results to proposed results
- Testing on a small percent of live traffic
- "launch report" for Launch Committee
Google claims to have run 8000 SxS tests and 2500 "live" click tests in 2010, with 500 changes launched
"Google's stated goal is to make its ranking algorithms better in order to provide the user with the best experience possible."

b. Google's Development and Introduction of Vertical Properties

Google vertical properties launched in stages, initially around 2001
Google News, Froogle (shopping), Image Search, and Groups
Google has separate indexes for each vertical
Around 2005 ,Google realized that vertical search engines, i.e., aggregators in some categories were a "threat" to dominance in web search, feared that these could cause shift in some searches away from Google
From GOOG-Texas-1325832-33 (2010): "Vertical search is of tremendous strategic importance to Google. Otherwise the risk is that Google is the go-to place for finding information only in the cases where there is sufficiently low monetization potential that no niche vertical search competitor has filled the space with a better alternative."
2008 presentation titled "Online Advertising Challenges: Rise of the Aggregators":
- "Issue 1. Consumers migrating to MoneySupermarket. Driver: General search engines not solving consumer queries as well as specialized vertical search Consequence: Increasing proportion of visitors going directly to MoneySupermarket. Google Implication: Loss of query volumes."
- Issue 2: "MoneySupermarket has better advertiser proposition. Driver: MoneySupermarket offers cheaper, lower risk (CPA-based) leads to advertisers. Google Implication: Advertiser pull: Direct advertisers switch spend to MoneySupermarket/other channels"
In response to this threat, Google invested in existing verticals (shopping, local) and invested in new verticals (mortgages, offers, hotel search, flight search)

c. The Evolution of Display of Google's Vertical Properties on the SERP

Google initially had tabs that let users search within verticals
In 2003, Marissa Mayer started developing "Universal Search" (launched in 2007), to put this content directly on Google's SERP. Mayer wrote:
- "Universal Search is an effort to redesign the user interface of the main Google.com results page SO that Google deliver[s] the most relevant information to the user on Google.com no matter what corpus that information comes from. This design is motivated by the fact that very few users are motivated to click on our tabs, SO they often miss relevant results in the other corpora."
Prior to Universal Search launch, Google used "OneBoxes", which put vertical content above Google's SERP
After launching Universal Search, vertical results could go anywhere

d. Google's Preferential Display of Google Vertical Properties on the SERP

Google used control over Google SERP both to improve UX for searches and to maximize benefit to its own vertical properties
Google wanted to maximize percentage of queries that had Universal Search results and drive traffic to Google properties
- In 2008, goal to "[i]ncrease google.com product search inclusion to the level of google.com searches with 'product intent', while preserving clickthrough rate." (GOOG-Texas-0227159-66)
- Q1 2008, goal of triggering Product Universal on 6% of English searches
- Q2 2008, goal changed to top OneBox coverage of 50% with 10% CTR and "[i]ncrease coverage on head queries. For example, we should be triggering on at least 5 of the top 10 most popular queries on amazon.com at any given time, rather than only one."
- "Larry thought product should get more exposure", GOOG-ITA-04-0004120-46 (2009)
- Mandate from exec meeting to push product-related queries as quickly as possible
- Launch Report for one algorithm change: 'To increase triggering on head queries, Google also implemented a change to trigger the Product Universal on google.com queries if they appeared often in the product vertical. "Using Exact Corpusboost to Trigger Product Onebox" compares queries on www.google.com with queries on Google Shopping, triggers the Product OneBox if the same query is often searched in Google Shopping, and automatically places the universal in position 4, regardless of the quality of the universal results or user "bias" for top placement of the box.'
- "presentation stating that Google could take a number of steps to be "#1" in verticals, including "[e]ither [getting] high traffic from google.com, or [developing] a separate strong brand," and asking: "How do we link from Search to ensure strong traffic without harming user experience or AdWords proposition for advertisers?")", GOOGFOX-000082469 (2009)
- Jon Hanke, head of Google Local, to Marissa Mayer: "long term, I think we need to commit to a more aggressive path w/ google where we can show non-webpage results on google outside of the universal 'box' most of us on geo think that we won't win unless we can inject a lot more of local directly into google results."
  - "Google's key strengths are: Google.com real estate for the ~70MM of product queries/day in US/UK/DE alone"
  - "I think the mandate has to come down that we want to win [in local] and we are willing to take some hits [i.e., trigger incorrectly sometimes]. I think a philosophical decision needs to get made that results that are not web search results and that displace web pages are "OK" on google.com and nothing to be ashamed of. That would open the door to place page or local entities as ranked results outside of some 'local universal' container. Arguably for many queriesall of the top 10 results should be local entities from our index with refinement options. The current mentality is that the google results page needs to be primarily about web pages, possibly with some other annotations if they are really, really good. That's the big weakness that bing is shooting at w/ the 'decision engine' pitch - not a sea of pointers to possible answers, but real answers right on the page. "
- In spring 2008, Google estimated top placement of Product Universal would lead to loss of $154M/yr on product queries. Ads team requested reduction in triggering frequency and Product Universal team objected, "We face strong competition and must move quickly. Turning down onebox would hamper progress as follows - Ranking: Losing click data harms ranking; [t]riggering Losing CTR and google.com query distribution data triggering accuracy; [c]omprehensiveness: Losing traffic harms merchant growth and therefore comprehensiveness; [m]erchant cooperation: Losing traffic reduces effort merchants put into offer data, tax, & shipping; PR: Turning off onebox reduces Google's credibility in commerce; [u]ser awareness: Losing shopping-related UI on google.com reduces awareness of Google's shopping features."
"Google embellished its Universal Search results with photos and other eye-catching interfaces, recognizing that these design choices would help steer users to Google's vertical properties"
- "Third party studies show the substantial difference in traffic with prominent, graphical user interfaces"; "These 'rich' user interfaces are not available to competing vertical websites"
Google search results near or at top of SERP, pushing other results down, resulting in reduced CTR to "natural search results"
- Google did this without comparing quality of Google's vertical content to competitors or evaluating whether users prefer Google's vertical content to displaced results
click-through from eBay indicates that (Jan-Apr 2012), Google Product Search appeared in top 5 positon 64% of time when displayed and Google Product Search had lower CTR than web search in same position regardless of position [below is rank: natural result CTR / Google Shopping CTR / eBay CTR]
- 1: 38% / 21% / 31%
- 2: 21% / 14% / 20%
- 3: 16% / 12% / 18%
- 4: 13% / 9% / 11%
- 5: 10% / 8% / 10%
- 6: 8% / 6% / 9%
- 7: 7% / 5% / 9%
- 8: 6% / 2% / 7%
- 9: 6% / 3% / 6%
- 10 5% / 2% / 6%
- 11: 5% / 2% / 5%
- 12: 3% / 1% / 4%
Although Google tracks CTR and relies on CTR to improve web results, it hasn't relied on CTR to rank Universal Search results against other web search results
Marissa Mayer said Google didn't use CTR " because it would take too long to move up on the SERP on the basis of user click-through rate"
Instead, "Google used occurrence of competing vertical websites to automatically boost the ranking of its own vertical properties above that of competitors"
- If comparison shopping site was relevant, Google would insert Google Product search above any rival
- If local search like Yelp or CitySearch was relevant, Google automatically returned Google Local at top of SERP
Google launched commission-based verticals, mortgage, flights, offers, in ad space reserved exclusively for its own properties
- In 2012, Google announced that google product search would transition to paid and Google would stop including product listings for merchants who don't pay to be listed
- Google's dedicated ads don't competition with other ads via AdWords and automatically get the most effective ad spots, usually above natural search results
- As with Google's Universal results, its own ads have a rich user interface not available to competitors which results in higher CTR

e. Google's Demotion of Competing Vertical Websites

"While Google embarked on a multi-year strategy of developing and showcasing its own vertical properties, Google simultaneously adopted a strategy of demoting, or refusing to display, links to certain vertical websites in highly commercial categories"
"Google has identified comparison shopping websites as undesirable to users, and has developed several algorithms to demote these websites on its SERP. Through an algorithm launched in 2007, Google demoted all comparison shopping websites beyond the first two on its SERP"
"Google's own vertical properties (inserted into Google's SERP via Universal Search) have not been subject to the same demotion algorithms, even though they might otherwise meet the criteria for demotion."
- Google has acknowledged that its own vertical sites meet the exact criteria for demotion
- Additionally, Google's web spam team originally refused to add Froogle to search results because "[o]ur algorithms specifically look for pages like these to either demote or remove from the index."
- Google's web spam team also refused to add Google's local property

f. Effects of Google's SERP Changes on Vertical Rivals

"Google's prominent placement and display of its Universal Search properties, combined with the demotion of certain vertical competitors in Google's natural search results, has resulted in significant loss of traffic to many competing vertical websites"
"Google's internal data confirms the impact, showing that Google anticipated significant traffic loss to certain categories of vertical websites when it implemented many of the algorithmic changes described above"
"While Google's changes to its SERP led to a significant decrease in traffic for the websites of many vertical competitors, Google's prominent showcasing of its vertical properties led to gains in user share for its own properties"
"For example, Google's inclusion of Google Product Search as a Universal Search result took Google Product Search from a rank of seventh in page views in July 2007 to the number one rank by July 2008. Google product search leadership acknowledged that '[t]he majority of that growth has been driven through product search universal.'"
"Beyond the direct impact on traffic to Google and its rivals, Google's changes to its SERP have led to reduced investment and innovation in vertical search markets. For example, as a result of the rise of Google Product Search (and simultaneous fall of rival comparison shopping websites), NexTag has taken steps to reduce its investment in this area. Google's more recent launch of its flight search product has also caused NexTag to cease development of an 'innovative and competitive travel service.'"

2. Google's "Scraping" of Rivals' Vertical Content

"Staff has investigated whether Google has "scraped" - or appropriated - the content of rival vertical websites in order to improve its own vertical properties SO as to maintain, preserve, or enhance Google's monopoly power in the markets for search and search advertising. We recommend that the Commission issue a complaint against Google for this conduct."
In addition to developing its own vertical properties, Google scraped content from existing vertical websites (e.g., Yelp, TripAdvisor, Amazon) in order to improve its own vertical listings, "e.g., GOOG-Texas-1380771-73 (2009), at 71-72 (discussing importance of Google Places carrying better review content from Yelp)."

a. The "Local" Story

"Some local information providers, such as Yelp, TripAdvisor, and CitySearch, disapprove of the ways in which Google has made use of their content"
"Google recognized that review content, in particular, was "critical to winning in local search," but that Google had an 'unhealthy dependency' on Yelp for much of its review content. Google feared that its heavy reliance on Yelp content, along with Yelp's success in certain categories and geographies, could lead Yelp and other local information websites to siphon users' local queries away from Google"
- "concern that Yelp could become competing local search platforms" (Goog-Texas-0975467-97)
Google Local execs tried to convince Google to acquire Yelp, but failed
Yelp, on finding that Google was going to use reviews on its own property, discontinued its feed and asked for Yelp content to be removed from Google Local
"after offering its own review site for more than two years, Google recognized that it had failed to develop a community of users - and thus, the critical mass of user reviews - that it needed to sustain its local product.", which led to failed attempt to buy Yelp
- To address this problem, Google added Google Places results on SERP: "The listing for each business that came up as a search result linked the user directly to Google's Places page, with a label indicating that hundreds of reviews for the business were available on the Places page (but with no links to the actual sources of those reviews).On the Places Page itself, Google provided an entire paragraph of each copied review (although not the complete review), followed by a link to the source of the review, such as Yelp (which it crawled for reviews) and TripAdvisor (which was providing a feed)."
- Yelp noticed this in July 2010, that Google was featuring Yelp's content without a license and protested to Google. TripAdvisor chose not to renew license with Google after finding same
- Google implemented new policy that would ban properties from Google search if they didn't allow their content to be used in Google Places
  - "GOOG-Texas-1041511-12 (2010), at 12 ("remove blacklist of yelp [reviews] from Web-extracted Reviews once provider based UI live"); GOOG-Texas-1417391-403 (2010), at 394 ("stating that Google should wait to publish a blog post on the new UI until the change to "unblacklist Yelp" is "live")."
- Along with this policy, launched new reviews product and seeded it reviews from 3rd party websites without attribution
- Yelp, CitySearch, and TripAdvisor all complained and were all told that they could only remove their content if they were fully removed from search results. "This was not technically necessary - it was just a policy decision by Google."
- Yelp sent Google a C&D
- Google claimed it was technically infeasible to remove Yelp content from Google Places without also banning Yelp from search result
  - Google later did this, making it clear that the claim that it was technically infeasible was false
  - Google still maintained that it would be technically infeasible to remove Yelp from Google Places without removing it from "local merge" interface on SERP. Staff believes this assertion is false as well because Google maintains numerous "blacklists" that prevent content from being shown in specific locations
  - Mayer later admitted during hearing that the infeasible claim was false and that Google feared consequences of allowing websites to opt out of Google Places while staying in "local merge"
  - "Yelp contends that Google's continued refusal to link to Yelp on Google's 'local merge' interface on the main SERP is simply retaliation for Yelp seeking removal from Google Places."
"Publicly, Google framed its changes to Google Local as a redesign to move toward the provision of more original content, and thereby, to remove all third-party content and review counts from Google Local, as well as from the prominent "local merge" Universal Search interface on the main SERP. But the more likely explanation is that, by July 2011,Google had already collected sufficient reviews by bootstrapping its review collection on the display of other websites' reviews. It no longer needed to display third-party reviews, particularly while under investigation for this precise conduct."

b. The "Shopping" Story

[full notes omitted; story is similar to above, but with Amazon; similar claims of impossibility of removing from some places and not others; Amazon wanted Google to stop using Amazon star ratings, which Google claimed was impossible without blacklisting Amazon from all of web search, etc.; there's also a parallel story about Froogle's failure and Google's actions after that]

c. Effects of Google's "Scraping" on Vertical Rivals

"Because Google scraped content from these vertical websites over an extended period of time, it is difficult to point to declines in traffic that are specifically attributable to Google's conduct. However, the natural and probable effect of Google's conduct is to diminish the incentives of companies like Yelp, TripAdvisor, CitySearch, and Amazon to invest in, and to develop, new and innovative content, as the companies cannot fully capture the benefits of their innovations"

3. Google's API Restrictions

"Staff has investigated whether Google's restrictions on the automated cross-management of advertising campaigns has unlawfully contributed to the maintenance, preservation, or enhancement of Google's monopoly power in the markets for search and search advertising. Microsoft alleges that these restrictions are anticompetitive because they prevent Google's competitors from achieving efficient scale in search and search advertising. We recommend that the Commission issue a complaint against Google for this conduct."

a. Overview of the AdWords Platform

To set up AdWords, advertisers prepare bids. Can have thousands or hundreds of thousands of keywords.
- E.g., DirectTV might bid on "television", "TV", and "satellite" plus specific TV show names, such as "Friday Night Lights", as well as misspellings
- Bids can be calibrated by time and location
- Advertisers then prepare ads (called "creatives") and match with various groups of keywords
- Advertisers get data from AdWords, can evaluate effectiveness and modify bids, add/drop keywords, modify creative
  - This is called "optimization" when done manually; expensive and time-intensive
Initially two ways to access AdWords system, AdWords Front End and AdWords Editor
- Editor is a program. Allows advertisers to download campaign information from Google, make bulk changes offline, then upload changes back to AdWords
- Advertisers would make so many changes that system's capacity would be exceeded, causing outages
In 2004, Google added AdWords API to address problems
[description of what an API is omitted]

b. The Restrictive Conditions

AdWords API terms and conditions non-negotiable, apply to all users
One restriction prevents advertisers from using 3rd party tool or have 3rd party use a tool to copy data from AdWords API into ad campaign on another search network
Another, can't use 3rd party tool or have 3rd party use a tool to comingle AdWords campaign data with data from another search engine
The two conditions above will be referred to as "the restrictive conditions"
"These restrictions essentially prevent any third-party tool developer or advertising agency from creating a tool that provides a single user interface for multiple advertising campaigns. Such tools would facilitate cross-platform advertising."
"However, the restrictions do not apply to advertisers themselves, which means that very large advertisers, such as.Amazon and eBay, can develop - and have developed - their own multi-homing tools that simultaneously manage campaigns across platforms"
"The advertisers affected are those whose campaign volumes are large enough to benefit from using the AdWords API, but too small to justify devoting the necessary resources to develop in-house the software and expertise to manage multiple search network ad campaigns."

c. Effects of the Restrictive Conditions

i. Effects on Advertisers and Search Engine Marketers ("SEMs")

Prevents development of tools that would allow advertisers from managing ad campaigns on multiple search ad networks simultaneously
Google routinely audits API clients for compliance
Google has required SEMs to remove functionality, "e.g., GOOGEC-0180810-14 (2010) (Trada); GOOGEC-0180815-16 (2010) (MediaPlex); GOOGEC-0181055-58 (2010) (CoreMetrics); GOOGEC-0181083-87 (2010) (Keybroker); GOOGEC-0182218-330 (2008) (Marin Software). 251 Acquisio IR (Sep. 12, 2011); Efficient Frontier IR (Mar. 5, 2012)"
Other SEMs have stated they would develop this functionality without restrictions
"Google anticipated that the restrictive conditions would eliminate SEM incentives to innovate.", "GOOGKAMA-000004815 (2004), at 2."
"Many advertisers have said they would be interested in buying a tool that had multi-homing functionality. Such functionality would be attractive to advertisers because it would reduce the costs of managing multiple ad campaigns, giving advertisers access to additional advertising opportunities on multiple search advertising networks with minimal additional investment of time. The advertisers who would benefit from such a tool appear to be the medium-sized advertisers, whose advertising budgets are too small to justify hiring a full service agency, but large enough to justify paying for such a tool to help increase their advertising opportunities on multiple search networks."

ii. Effects on Competitors

Removing restrictions would increase ad spend on networks that compete with Google
Data on advertiser multi-homing show some effects of restrictive conditions. Nearly all the largest advertisers multi-home, but percentage declines as spend decreases
- Advertisers would also multi-home with more intensity
  - Microsoft claims that multi-homing advertisers optimize their Google campaigns almost-daily, Microsoft campaigns less frequently, weekly or bi-weekly
Without incremental transaction costs, "all rational advertisers would multi-home"
Staff interviewed randomly selected small advertisers. Interviews "strongly supported" thesis that advertises would multi-home if cross-platform optimization tool were available
- Some advertisers don't advertise on Bing due to lack of tool, the ones that do do less optimization

d. Internal Google Discussions Regarding the Restrictions

Internal discussions support the above
PM wrote the following in 2007, endorsed by director of PM Richard Holden:
- "If we offer cross-network SEM in [Europe], we will give a significant boost to our competitors. Most advertisers that I have talked to in [Europe] don't bother running campaigns on [Microsoft] or Yahoo because the additional overhead needed to manage these other networks outweighs the small amount of additional traffic. For this reason, [Microsoft] and Yahoo still have a fraction of the advertisers that we have in [Europe], and they still have lower average CPAs [cost per acquisition]"
- "This last point is significant. The success of Google's AdWords auctions has served to raise the costs of advertising on Google. With more advertisers entering the AdWords auctions, the prices it takes to win those auctions have naturally risen. As a result, the costs per acquisition on Google have risen relative to the costs per acquisition on Bing and Yahoo!. Despite these higher costs, as this document notes, advertisers are not switching to Bing and Yahoo! because, for many of them, the transactional costs are too great."
In Dec 2008, Google team led by Richard Holden evaluated possibility of relaxing or removing restrictive conditions and consulted with Google chief economist Hal Varian. Some of Holden's observations:
- Advertisers seek out SEMs and agencies for cross-network management technology and services;
- The restrictive conditions make the market more inefficient;
- Removing the restrictive conditions would "open up the market" and give Google the opportunity to compete with a best-in-class SEM tool with "a streamlined workflow";
- Removing the restrictive conditions would allow SEMs to improve their tools as well;
- While there is a risk of additional spend going to competing search networks, it is unlikely that Google would be seriously harmed because "advertisers are going where the users are," i.e., to Google
"internally, Google recognized that removing the restrictions would create a more efficient market, but acknowledged a concern that doing so might diminish Google's grip on advertisers."
"Nonetheless, following up on that meeting, Google began evaluating ways to improve the DART Search program. DART Search was a cross-network campaign management tool owned by DoubleClick, which Google acquired in 2008. Google engineers were looking at improving the DART Search product, but had to confront limitations imposed by the restrictive conditions. During his investigational hearing, Richard Holden steadfastly denied any linkage between the need to relax the restrictive conditions and the plans to improve DART Search. ²⁷⁴ However, a series of documents - documents authored by Holden - explicitly link the two ideas."
Dec 2008 Holden to SVP of ad products, Susan Wojcicki and others met.
- Holden wrote: "[O]ne debate we are having is whether we should eliminate our API T&Cs requirement that AW [AdWords] features not be co-mingled with competitor network features in SEM cross-network tools like DART Search. We are advocating that we eliminate this requirement and that we build a much more streamlined and efficient DART Search offering and let SEM tool provider competitors do the same. There was some debate about this, but we concluded that it is better for customers and the industry as a whole to make things more efficient and we will maximize our opportunity by moving quickly and providing the most robust offering"
Feb 2009, Holden wrote exec summary for DART, suggested Google ""alter the AdWords Ts&Cs to be less restrictive and produce the leading cross-network toolset that increases advertiser/agency efficiency." to "[r]educe friction in the search ads sales and management process and grow the industry faster"
Larry Page rejected this. Afterwards, Holden wrote "We've heard that and we will focus on building the product to be industry-leading and will evaluate it with him when it is done and then discuss co-mingling and enabling all to do it."
Sep 2009, API PM raised possibility of eliminating restrictive conditions to help DART. Comment from Holden:
- "I think the core issue on which I'd like to get Susan's take is whether she sees a high risk of existing spend being channeled to MS/Yahoo! due to a more lenient official policy on campaign cloning. Then, weigh that risk against the benefits: enabling DART Search to compete better against non-compliant SEM tools, more industry goodwill, easier compliance enforcement. Does that seem like the right high level message?"
"The documents make clear that Google was weighing the efficiency of relaxing the restrictions against the potential cost to Google in market power"
"At a January 2010 meeting, Larry Page decided against removing or relaxing the restrictive conditions. However, there is no record of the rationale for that decision or what weight was given to the concern that relaxing the restrictive conditions might result in spend being channeled to Google's competitors. Larry Page has not testified. Holden testified that he did not recall the discussion. The participants at the meeting did not take notes "for obvious reasons." Nonetheless, the documents paint a clear picture: Google rejected relaxing the API restrictions, and at least part of the reason for this was fear of diverting advertising spend to Microsoft."
- Holden to Wojcicki: "We didn't take notes for obvious reasons (hence why I'm not elaborating too much here in email) but happy to brief you more verbally".

4. Google's Exclusive and Restrictive Syndication Agreements

"Staff has investigated whether Google has entered into exclusive or highly restrictive agreements with website publishers that have served to maintain, preserve, or enhance Google's monopoly power in the markets for search, search advertising, or search and search advertising syndication (or "search intermediation"). We recommend that the Commission issue a complaint against Google for this conduct."

a. Publishers and Market Structure

Buyers of search and search ad syndication are website publishers
Largest sites account for vast majority of syndicated search traffic and volume
Biggest customers are e-commerce retailers (e.g., Amazon and eBay), traditional retailers with websites (e.g., Wal-Mart, Target, Best Buy), and ISPs which operate their own portals
Below this group, companies with significant query volume, including vertical e-commerce sites such as Kayak, smaller retailers and ISPs such as EarthLink; all of these are < 1% of Google's total AdSense query volume
Below, publisher size rapidly drops off to < 0.1% of Google's query volume
Payment publisher receives a function of
- volume of clicks on syndicated ad
- "CPC", or cost-per-click advertiser willing to pay for each click
- revenue sharing percentage
rate of user clicks and CPC aggregated to form "monetization rate"

b. Development of the Market for Search Syndication

First AdSense for Search (AFS) agreements with AOL and EarthLink in 2002
- Goal then was to grow nascent industry of syndicated search ads
- At the time, Google was bidding against incumbent Overture (later acquired by Yahoo) for exclusive agreements with syndication partners
Google's early deals favored publishers
To establish a presence, Google offered up-front financial guarantees to publishers

c. Specifics of Google's Syndication Agreements

"Today, the typical AdSense agreement contains terms and conditions that describe how and when Google will deliver search, search advertising, and other (contextual or domain related) advertising services."
Two main categories are AFS (search) and AFC (content). Staff investigation focused on AFS
For AFS, two types of agreements. GSAs (Google Service Agreements) negotiated with large partners and standard online contracts, which are non-negotiable and non-exclusive
Bulk of AFS partners are on standard online agreements, but those are a small fraction of revenue
Bulk of revenue comes from GSAs with Google's 10 largest partners (almost 80% of query volume in 2011). All GSAs have some form of exclusivity or "preferred placement" for Google
"Google's exclusive AFS agreements effectively prohibit the use of non-Google search and search advertising within the sites and pages designated in the agreement. Some exclusive agreements cover all properties held by a publisher globally; other agreements provide for a property-by-property (or market-by-market) assignment"
By 2008, Google began to migrate away from exclusivity to "preferred placement". Google must display minimum of 3 ads or number of any competitor (whichever is greater), in an unbroken block, with "preferred placement" (in the most prominent position on publisher's website)
Google had preferred placement restrictions in GSAs and standard online agreement. Google maintains it was not aware of this provision in standard online agreement until investigational hearing of Google VP for search services, Joan Braddi, where staff questioned Braddi
- See Letter from Scott Sher, Wilson Sonsini, to Barbara Blank (May 25, 2012) (explaining that, as of the date of the letter, Google was removing the preferred placement clause from the Online Terms and Conditions, and offering no further explanation of this decision)

d. Effects of Exclusivity and Preferred Placement

Staff interviewed large and small customers for search and search advertising syndication. Key findings:

i. Common Publisher Responses

Universal agreement that Bing's search and search advertising markedly inferior, not competitive across-the-board
- Amazon reports that Bing monetizes at half the rate of Google
- business.com told staff that Google would have to cut revenue share from 64.5% to 30% and Microsoft would have to provide 90% share because Microsoft's platform has such low monetization
Customers "generally confirmed" Microsoft's claim that Bing's search syndication is inferior in part because Microsoft's network is smaller than Google's
- With a larger ad base, Google more likely to have relevant, high-quality, ad for any given query, which improves monetization rate
A small publisher said, essentially, the only publishers exclusively using Bing are ones who've been banned from Google's service
- We know from other interviews this is an exaggeration, but it captures the general tenor of comments about Microsoft
Publishers reported Microsoft not aggressively trying to win their business
- Microsoft exec acknowledge that Bing needs a larger portfolio of advertisers, has been focused there over winning new syndication business
Common theme from many publishers is that search is a relatively minor part of their business and not a strategic focus. For example, Wal-Mart operates website as extension to retail and Best Buy's main goal of website is to provide presale info
Most publishers hadn't seriously considered Bing due to poor monetization
Amazon, which does use Bing and Google ads, uses a single syndication provider on a page to avoid showing the user the same ad multiple times on the same page; mixing and matching arrangement generally considered difficult by publishers
Starting in 2008, Google systematically tried to lower revenue share for AdSense partners
- E.g., "Our general philosophy with renewals has been to reduce TAC across the board", "2009 Traffic Acquisition Cost (TAC) was down 3 percentage points from 2008 attributable to the application of standardized revenue share guidelines for renewals and new partnerships...", etc.
Google reduced payments (TAC) to AFS partners from 80.4% to 74% between Q1 2009 and Q1 2010
No publisher viewed reduction as large enough to justify shifting to Bing or serving more display ads instead of search ads

ii. Publishers' Views of Exclusivity Provisions

Some large publishers reported exclusive contracts and some didn't
Most publishers with exclusivity provisions didn't complain about them
A small number of technically sophisticated publishers were deeply concerned by exclusivity
- These customers viewed search and search advertising as a significant part of business, have the sophistication to integrate multiple suppliers into on-line properties
- eBay: largest search and search ads partner, 27% of U.S. syndicated search queries in 2011
  - Contract requires preferential treatment for AdSense ads, which eBay characterizes as equivalent to exclusivity
  - eBay wanted this removed in last negotiation, but assented to not removing it in return for not having revenue share cut while most other publishers had revenue share cut
  - eBay's testing indicates that Bing is competitive in some sectors, e.g., tech ads; they believe they could make more money with multiple search providers
- NexTag: In 2015, Google's 15th largest AFS customer
  - Had exclusivity, was able to remove it in 2010, but NexTag considers restrictions "essentially the same thing as exclusivity"; "NexTag reports that moving away from explicit exclusivity even to this kind of de facto exclusivity required substantial, difficult negotiations with Google"
  - Has had discussions with Yahoo and Bing about using their products "on a filler basis", but unable to do so due to Google contract restrictions
- business.com: B2B lead generation / vertical site; much smaller than above. Barely in top 60 of AdSense query volume
  - Exclusive agreement with Google
  - Would test Bing and Yahoo without exclusive agreement
  - Agreement also restricts how business.com can design pages
  - Loosening exclusivity would improve business.com revenue and allow for new features that make the site more accessible and user-friendly
- Amazon: 2nd largest AFS customer after eBay; $175M from search syndication, $169M from Google AdSense
  - Amazon uses other providers despite their poor monetization due to concerns about having a single supplier; because Amazon operates on thin margins, $175M is a material source of profit
  - Amazon concerned it will be forced to sign an exclusive agreement in next negotiation
  - During last negotiation, Amazon wanted 5-year deal, Google would only give 1-year extension unless Amazon agreed to send Google 90% of search queries (Amazon refused to agree to this formally, although they do this)
- IAC: umbrella company operating ask.com, Newsweek, CityGrid, Urbanspoon, and other websites
  - Agreement is exclusive on a per-property basis
  - IAC concerned about exclusivity. CityGrid wanted mix-and-match options, but couldn't compete with Google's syndication network, forced to opt into IAC's exclusive agreement; CityGrid wants to use other networks (including its own), but can't under agreement with Google
  - IAC concerned about lack of competition in search and search advertising syndication
  - Execute who expressed above concerns left, new executive didn't see a possibility of splitting or moving traffic
  - "The departure of the key executive with the closest knowledge of the issues and the most detailed concerns suggests we may have significant issues obtaining clear, unambiguous testimony from IAC that reflects their earlier expressed concerns."

iii.Effects on Competitors

Microsoft asserts even 5%-10% increase in query volume "very meaningful" and Google's exclusive and restrictive agreements deny Microsoft incremental scale to be more efficient competitor
Speciality search ad platforms also impacted; IAC sought to build platform for local search advertising, but Google's exclusivity provisions "make it less likely that small local competitors like IAC's nascent offering can viably emerge."

III. LEGAL ANALYSIS

"A monopolization claim under Section 2 of the Sherman Act, 15 U.S.C. § 2, has two elements: (i) the 'possession of monopoly power in the relevant market' and (ii) the 'willful acquisition or maintenance of that power as distinguished from growth or development as a consequence of a superior product, business acumen, or historic accident.'"
"An attempted monopolization claim requires a showing that (i) 'the defendant has engaged in predatory or anticompetitive conduct' with (ii) 'a specific intent to monopolize' and (iii) a dangerous probability of achieving or maintaining monopoly power."

A. GOOGLE HAS MONOPOLY POWER IN RELEVANT MARKETS

"'A firm is a monopolist if it can profitably raise prices substantially above the competitive level. [M]onopoly power may be inferred from a firm's possession of a dominant share of a relevant market that is protected by entry barriers.' Google has monopoly power in one or more properly defined markets."

1. Relevant Markets and Market Shares

"A properly defined antitrust market consists of 'any grouping of sales whose sellers, if unified by a hypothetical cartel or merger, could profitably raise prices significantly above the competitive level.'"
"Typically, a court examines 'such practical indicia as industry or public recognition of the submarket as a separate economic entity, the product's peculiar characteristics and uses, unique production facilities, distinct customers, distinct prices, sensitivity to price changes, and specialized vendors.'"
"Staff has identified three relevant antitrust markets."

a. Horizontal Search

Vertical search engines not a viable substitute to horizontal search; formidable barriers to expanding into horizontal search
Vertical search properties could pick up query volume in response to SSNIP (small, but significant non-transitory increase in price) in horizontal search, potentially displacing horizontal search providers
Google views these with concern, has aggressively moved to build its own vertical offerings
No mechanism for vertical search properties to broadly discipline a monopolist in horizontal search
- Web search queries monetized through search ads, ads sold by keyword which have independent demand functions. So, at best, monopolist might be inhibited from SSNIP on a narrow set of keywords with strong vertical competition. But for billions of queries with no strong vertical, nothing constrains monopolist from SSNIP
Where vertical websites exist, still hard to compete; comprehensive coverage of all areas seems to be important driver of demand, even to websites focusing on specific topics. Eric Schmidt noted this:
- "So if you, for example, are an academic researcher and you use Google 30 times for your academics, then perhaps you'll want to buy a camera... So long as the product is very, very, very, very good, people will keep coming back... The general product then creates the brand, creates demand and so forth. Then occasionally, these ads get clicked on"
Schmidt's testimony corroborated by several vertical search firms, who note that they're dependent on horizontal search providers for traffic because vertical search users often start with Google, Bing, or Yahoo
When asked about competitors in search, Eric Schmidt mentioned zero vertical properties
- Google internal documents monitor Bing and Yahoo and compare quality. Sergei Brin testified that he wasn't aware of any such regular comparison against vertical competitors
Relevant geo for web search limited to U.S. here; search engines return results relevant to users in country they're serving, so U.S. users unlikely to view foreign-specialized search engines as viable substitute
Although Google has managed to cross borders, other major international search engines (Baidu, Yandex) have filed to do this
Google dominant for "general search" in U.S.; 66.7% share according to ComScore, and also provides results to ask.com and AOL, another 4.6%
Yahoo 15%, Bing 14%
Google's market share above generally accepted floor for monopolization; defendants with share in this range have been found to have monopoly power

b. Search Advertising

Search ads likely a properly defined market
Search ads distinguishable from other online ads, such as, display ads, contextual ads, behavioral ads, social media ads due to "inherent scale, targetability, and control"
- Google: "[t]hey are such different products that you do not measure them against one another and the technology behind the products is different"
Evidence suggests search and display ads are complements, not substitutes
- "Google has observed steep click declines when advertisers have attempted to shift budget to display advertising"
- Chevrolet suspended search ads for 2 weeks and relied on display ads alone; lost 30% of clicks
New ad offerings don't fit into traditional search or display categories: contextual, re-targeted display (or behavioral), social media
- Only search ads allow advertisers to show ad based on when user is expressing an interest in the moment the ad is shown; numerous advertisers confirmed this point
- Search ads convert at much higher rate due to this advantage
Numerous advertisers report they wouldn't shift ad spend away from search ads if prices increased more than SSNIP. Living Social would need 100% price increase before shifting ads (a minority of advertisers reported they would move ad dollars from search in response to SSNIP)
Google internal documents and testimony confirm lack of viable substitute for search. AdWords VP Nick Fox and chief economist Hal Varian have stated that search ad spend doesn't come at expense of other ad dollars, Eric Schmidt has testified multiple times that search ads are the most effective ad tool, has best ROI
Google, through AdWords, has 76% to 80% of the market according to industry-wide trackers (rival Bing-Yahoo has 12% to 16%)
[It doesn't seem wrong to say that search ads are a market and that Google dominates that market, but the primacy of search ads seems overstated here? Social media ads, just becoming important at the time, ended up becoming very important, and of course video as well]

c. Syndicated Search and Search Advertising ("Search Intermediation")

Syndicated search and search advertising ("search intermediation") are likely a properly defined product market
Horizontal search providers sell ("syndicate") services to other websites
Search engine can also return search ads to the website; search engine and website share revenue
Consumers are websites that want search; sellers are horizontal search providers, Google, Bing, Yahoo
Publishers of various sizes consistent on cross-elasticity of demand; report that search ad syndication monetizes better than display advertising or other content
No publisher told us that modest (5% to 10%) increase in price for search and search ad syndication would favor other forms of advertising or web content
Google's successful efforts to systematically reduce TAC support this, are a natural experiment to determine likely response to SSNIP
Google, via AdSense, is dominant provider of search and search ad syndication; 75% of market according to ComScore (Microsoft and Yahoo combine for 22%)

2. Substantial Barriers to Entry Exist

"Developing and maintaining a competitively viable search or search ad platform requires substantial investment in specialized knowledge, technology, infrastructure, and time. These markets are also characterized by significant scale effects"

a. Technology and Specialization

[no notes, extremely obvious to anyone technical who's familiar with the area]

b. Substantial Upfront Investment

Enormous investments required. For example in 2011, Google spent $5B on R&D. And in 2010, MS spent more than $4.5B developing algorithms and building physical capacity for Bing

c. Scale Effects

More usage leads to better algorithms and greater accuracy w.r.t. what consumers want
Also leads to greater number of advertisers
Greater number of advertisers and consumers leads to better ad serving accuracy, better monetization of ads, leads to better monetization for search engine, advertisers, and syndication partners
Cyclical effect, "virtuous cycle"
According to Microsoft, greatest barrier is obtaining sufficient scale. Losing $2B/yr trying to compete with Google, and Bing is only competing horizontal search platform to Google

d. Reputation, Brand Loyalty, and the "Halo Effect"

[no notes]

e. Exclusive and Restrictive Agreements -

"Google's exclusive and restrictive agreements pose yet another barrier to entry, as many potential syndication partners with a high volume of customers are locked into agreements with Google."

B. GOOGLE HAS ENGAGED IN EXCLUSIONARY CONDUCT

"Conduct may be judged exclusionary when it tends to exclude competitors 'on some basis other than efficiency,' i.e., when it 'tends to impair the opportunities of rivals' but 'either does not further competition on the merits or does SO in an unnecessarily restrictive way.' In order for conduct to be condemned as 'exclusionary,' Staff must show that Google's conduct likely impairs the ability of its rivals to compete effectively, and thus to constrain Google's exercise of monopoly power"

1. Google's Preferencing of Google Vertical Properties Within Its SERP

"Although we believe that this is a close question, we conclude that Google's preferencing conduct does not violate Section 2."

a. Google's Product Design Impedes Vertical Competitors

"As a general rule, courts are properly very skeptical about claims that competition has been harmed by a dominant firm's product design changes. Judicial deference to product innovation, however, does not mean that a monopolist's product design decisions are per se lawful", United States v. Microsoft
We evaluate, through Microsoft lens of monopoly maintenance, whether Google took these actions to impede a nascent threat to Google's monopoly power
"Google's internal documents explicitly reflect - and testimony from Google executives confirms - a concern that Google was at risk of losing, in particular, highly profitable queries to vertical websites"
VP of product management Nicholas Fox:
- "[Google's] inability to serve this segment [of vertical lead generation] well today is negatively impacting our business. Query growth among high monetizing queries (>$120 RPM) has declined to ~0% in the UK. US isn't far behind (~6%). There's evidence (e.g., UK Finance) that we're losing share to aggregators"
Threat to Google isn't vertical websites, displacing Google, but that they'll undercut Google's power over the most lucrative segments of search and search ads portfolio
Additionally, vertical websites could help erode barriers to growth for general search competitors

b. Google's SERP Changes Have Resulted In Anticompetitive Effects

Google expanding its own offerings while demoting rival offerings caused significant drops in traffic to rivals, confirmed by Google's internal data
Google's prominent placement of its own Universal Search properties led to gains in share of its own properties
- "For example, Google's inclusion of Google Product Search as a Universal Search result turned a property that the Google product team could not even get indexed by Google's web search results into the number one viewed comparison shopping website on Google"

c. Google's Justifications for the Conduct

"Product design change is an area of conduct where courts do not tend to strictly scrutinize asserted procompetitive justifications. In any event, Google's procompetitive justifications are compelling."
Google argues design changes to SERP have improved product, provide consumers with "better" results
Google notes that path toward Universal Search and OneBox predates concern about vertical threat
Google justifies preferential treatment of Universal Search by asserting "apples and oranges" problem prevents Google from doing head-to-head comparison of its property vs. competing verticals, verticals and web results ranked with different criteria. This seems to be correct.
- Microsoft says Bing uses a single signal, click-through-rate, that can be compared across Universal Search content and web search results
Google claims that its Universal Search results are more helpful than than "blue links" to other comparison shopping websites
Google claims that showing 3rd party data would create technical and latency issues
- " The evidence shows that it would be technologically feasible to serve up third-party results in Google's Universal Search results. Indeed, Bing does this today with its flight vertical, serving up Kayak results and Google itself originally considered third-party OneBoxes"
Google defends "demotion" of competing vertical content, "arguing that Google's algorithms are designed solely with the goal of improving a user's search experience"
- "one aspect of Google's demotions that especially troubles Staff - and is not addressed by the above justification - is the fact that Google routinely, and prominently, displays its own vertical properties, while simultaneously demoting properties that are identical to its own, but for the fact that the latter are competing vertical websites", See Brin Tr. 79:16-81:24 (acknowledging the similarities between Google Product Search and its competitors); Fox Tr. 204:6-204:20 (acknowledging the similarities between Google Product Search and its competitors).

d. Google's Additional Legal Defenses

"Google has argued - successfully in several litigations - that it owes no duty to assist in the promotion of a rival's website or search platform, and that it owes no duty to promote a rival's product offering over its own product offerings"
"one reading of Trinko and subsequent cases is that Google is privileged in blocking rivals from its search platform unless its conduct falls into in one of several specific exceptions referenced in Trinko"
- "Alternatively, one may argue that Trinko should not be read so broadly as to overrule swathes of antitrust doctrine."
"Google has long argued that its general search results are opinions that are protected speech under the First Amendment, and that such speech should not be subject to government regulation"; staff believes this is overbroad
"the evidence paints a complex portrait of a company working toward an overall goal of maintaining its market share by providing the best user experience, while simultaneously engaging in tactics that resulted in harm to many vertical competitors, and likely helped to entrench Google's monopoly power over search and search advertising"
"The determination that Google's conduct is anticompetitive, and deserving of condemnation, would require an extensive balancing of these factors, a task that courts have been unwilling - in similar circumstances - to perform under Section 2. Thus, although it is a close question, Staff does not recommend that the Commission move forward on this cause of action."

2. Google's "Scraping" of Rivals' Vertical Content

"We conclude that this conduct violates Section 2 and Section 5."

a. Google's "Scraping" Constitutes a Conditional Refusal to Deal or Unfair Method Of Competition

Scraping and threats of refusal to deal with some competitors can be condemned as conditional refusal to deal under Section 2
Post-Trinko, identification of circumstances ("[u]nder certain circumstances, a refusal to cooperate with rivals can constitute anticompetitive conduct and violate § 2") "subject of much debate"
Aspen Skiing Co. v. Aspen Highlands Skiing Corp: defendant (owner of 3 of 4 ski areas in Aspen) canceled all-ski area ticket with plaintiff (owner of 4th ski area in Aspen)
- After demand increasing share of profit, defendant canceled ticket and rejected "increasingly desperate measures" to recreate joint ticket, even rejected plaintiff's offer to buy tickets at retail price
- Supreme court upheld jury's finding of liability; Trinko court: "unilateral termination of a voluntary (and thus presumably profitable) course of dealing suggested a willingness to forsake short-term profits to achieve an anticompetitive end. Similarly, the defendant's unwillingness to renew the ticket even if compensated at retail price revealed a distinctly anticompetitive bent"
Appellate courts have focused on Trinko's reference to "unilateral termination of a voluntary course of dealing", e.g., in American Central Eastern Texas Gas Co.v. Duke Energy Fuels LLC, Fifth Circuit upheld determination that defendant natural gas processor's refusal to contract with competitor for additional capacity was unlawful
- Plaintiff contracted with defendant for processing capacity; after two years, defendant proposed terms it "knew were unrealistic or completely unviable ... in order to exclude [the plaintiff] from competition with [the defendant] in the gas processing market."
Case here is analogous to Aspen Skiing and Duke Energy [a lot of detail not written down in notes here]

b. Google's "Scraping" Has Resulted In Anticompetitive Effects

Scraping has lessened the incentives of competing websites like Yelp, TripAdvisor, CitySearch, and Amazon to innovate, diminishes incentives of other vertical websites to develop new products
- entrepreneurs more reluctant to develop new sites, investors more reluctant to sponsor development when Google can use its monopoly power to appropriate content it deems lucrative

c. Google's "Scraping" Is Not Justified By Efficiencies

"Marissa Mayer and Sameer Samat testified that was extraordinarily difficult for Google, as a technical matter, to remove sites like Yelp from Google Local without also removing them from web search results"
- "Google's almost immediate compliance after Yelp sent a formal 'cease and desist' letter to Google, however, suggests that the "technical" hurdles were not a significant factor in Google's refusal to comply with repeated requests to remove competitor content from Google Local"
- Partners can opt out of inclusion with Google's vertical news offering, Google News
- "Similarly, Google's almost immediate removal of Amazon product reviews from Google Product Search indicates that technical barriers were quickly surmounted when Google desired to accommodate a partner."
"In sum, the evidence shows that Google used its monopoly position in search to scrape content from rivals and to improve its own complementary vertical offerings, to the detriment of those rivals, and without a countervailing efficiency justification. Google's scraping conduct has helped it to maintain, preserve, and enhance Google's monopoly position in the markets for search and search advertising. Accordingly, we believe that this conduct should be condemned by the Commission."

3. Google's API Restrictions

"We conclude that Google's API restrictions violate Section 2."
AdWords API procompetitive development
But restrictive conditions in API usage agreement anticompetitive, without offsetting procompetitive benefits
"Should the restrictive conditions be found to be unreasonable restraints of trade, they could be removed today instantly, with no adverse effect on the functioning of the API. Any additional engineering required to make the advertiser data interoperable with other search networks would be supplied by other market participants. Notably, because Google would not be required to give its competitors access to the AdWords API, there is no concern about whether Google has a duty to deal with its competitors"

a. The Restrictive Conditions Are Unreasonable

Restrictive conditions limit ability of advertisers to use their own data, prevent the development and sale of 3rd party tools and services that would allow automated campaign management across multiple search networks
"Even Google is constrained by these restrictions, having had to forgo improving its DART Search tool to offer such capabilities, despite internal estimates that such functionality would benefit Google and advertisers alike"
Restrictive conditions have no procompetitive virtues, anticompetitive effects are substantial

b. The Restrictive Conditions Have Resulted In Anticompetitive Effects

Restrictive conditions reduce innovation, increase transaction costs, degrade quality of Google's rivals in search and search advertising
Several SEMs forced to remove campaign cloning functionality by Google; Google's restrictive conditions stopped cross-network campaign management tool market segment in its infancy
Restrictive conditions increase transaction costs for all advertisers other than those large enough to make internal investments to develop their own tools [doesn't it also, in some amortized fashion, increase transaction costs for companies that can build their own tools?]
Result is that advertisers spend less on non-dominant search networks, reducing quality of ads on non-dominant search networks

c. The Restrictive Conditions Are Not Justified By Efficiencies

Concern about "misaligned incentives" is Google's only justification for restrictive conditions; concern is that SEMs and agencies would adopt a "lowest common denominator" approach and degrade AdWords campaign performance
"The evidence shows that this justification is unsubstantiated and is likely a pretext"
"In brief, these third parties incentives are highly aligned with Google's interests, precisely the opposite of what Google contends."
Google unable to identify an examples of ill effects from misaligned incentives
Terms and Conditions already have conditions for minimum functionality that prevents lowest common denominator concern from materializing
Documents suggest restrictive conditions were not about "misaligned incentives":
- "Sergey [Brin] and Larry [Page] are big proponents of a protectionist strategy that prevents third party developers from building offerings which promote the consolidated management of [keywords] on Google and Overture (and whomever else)."
- In a 2004 doc, API product manager was looking for "specific points on how we can prevent a new entrant (MSN Ad Network) from benefitting from a common 3rd party platform that is cross-network."
- In a related presentation, Google's lists as a concern, "other competitors are buoyed by lowered barriers to entry"; options to prevent this were "applications must have Google-centric UI functions and branding" and "disallow cross-network compatible applications from using API"

4. Google's Exclusive and Restrictive Syndication Agreements

"Staff has investigated whether Google has entered into anticompetitive, exclusionary agreements with websites for syndicated search and search advertising services (AdSense agreements) that serve to maintain, preserve, or enhance Google's monopoly power in the markets for search, search advertising, or search and search advertising syndication (search intermediation). We conclude that these agreements violate Section 2."

a. Google's Agreements Foreclose a Substantial Portion of the Relevant Market

"Exclusive deals by a monopolist harm competition by foreclosing rivals from needed relationships with distributors, suppliers, or end users. For example, in Microsoft, then-defendant Microsoft's exclusive agreements with original equipment manufacturers and software vendors were deemed anticompetitive where they were found to prevent third parties from installing rival browser Netscape, thus foreclosing Netscape from the most efficient distribution channel, and helping Microsoft to preserve its operating system monopoly. The fact that an agreement is not explicitly exclusive does not preclude a finding of liability."
[notes on legal background of computing foreclosure percentage omitted]
Staff relied on ComScore dataset to compute foreclosure; Microsoft and and Yahoo's syndicated query volume is higher than in ComScore, resulting in lower foreclosure number. "We are trying to get to the bottom of this discrepancy now. However, based on our broader understanding of the market, we believe that the ComScore set more accurately reflects the relative query shares of each party." [I don't see why staff should believe that ComScore is more accurate than Microsoft's numbers — I would guess the opposite]
[more notes on foreclosure percentage omitted]

b. Google's Agreements Have Resulted In Anticompetitive Effects

Once foreclosure is established as above "safe harbor" levels, need a qualitative, rule of reason analysis of market effects
Google's exclusive agreements impact immediate market for search and search syndication advertising and have broader effects in markets for search and search advertising
In search search ad syndication (search intermediation), exclusivity precludes some of the largest and most sophisticated publishers from using competing platforms. Publishers can't credibly threaten to shift some incremental business to other platforms to get price concessions from Google
- Google's aggressive reduction of revenue shares to customers without significant resistance => agreements seem to be further entrenching Google's monopoly position
An objection to this could be that Google's business is because its product is superior
- This argument rests on fallacious assumption that Bing's average monetization gap is consistent across the board
[section on CityGrid impact omitted; this section speaks to broader market effects]
Google insists that incremental traffic to Microsoft would be trivial; Microsoft indicates it would be "very meaningful"
- Not enough evidence for definitive conclusion, but "internal Google documents suggest that Microsoft's view of things may be closer to the truth. — Google's interest in renewing deals in part to prevent MIcrosoft from gaining scale. Internal Google analysis of 2010 AOL renewal: "AOL holds marginal search share but represents scale gains for a Microsoft + Yahoo! partnership. AOL/Microsoft combination has modest impact on market dynamics, but material increase in scale of Microsoft's search & ads platform"
- When informed that "Microsoft [is] aggressively wooing AOL with large guarantees,", a Google exec responded with: "I think the worse case scenario here is that AOL users get sent to Bing, so even if we make AOL a bit more competitive relative to Google, that seems preferable to growing Bing."
- Google internal documents show they pursued AOL deal aggressively even though AOL represented "[a] low/no profit partnership for Google."
Evidence is that, in near-term, removing exclusivity would not have dramatic impact; largest and most sophisticated publishers would shift modest amounts of traffic to Bing
Most significant competitive benefits realized over longer period of time
- "Removing exclusivity may open up additional opportunities for both established and nascent competitors, and those opportunities may spur more significant changes in the market dynamics as publishers have the opportunity to consider - and test - alternatives to Google's AdSense program."

c. Google's Agreements Are Not Justified By Efficiencies

Google has given three business justifications for exclusive and restrictive syndication agreements
- Long-standing industry practice of exclusivity, dating from when publishers demanded large, guaranteed, revenue share payments regardless of performance
  - "guaranteed revenue shares are now virtually non-existent"
- "Google is simply engaging in a vigorous competition with Microsoft for exclusive agreements"
  - "Google may argue that the fact that Microsoft is losing in a competitive bidding process (and indeed, not competing as vigorously as it might otherwise) is not a basis on which to condemn Google. However, Google has effectively created the rules of today's game, and Microsoft's substantial monetization disadvantage puts it in a poor competition position to compete on an all-or-nothing basis."
- "user confusion" — "Google claims that it does not want users to confuse a competitor's poor advertisements with its own higher quality advertisements"
  - "This argument suffers both from the fact that it is highly unlikely that users care about the source of the ad, as well as the fact that, if users did care, less restrictive alternatives are clearly available. Google has not explained why alternatives such as labeling competitor advertisements as originating from the competitor are unavailing here."
  - "Google's actions demonstrate that "user confusion" is not a significant concern. In 2008 Google attempted to enter into a non-exclusive agreement with Yahoo! to supplement Yahoo!'s search advertising platform. Under the proposed agreement, Yahoo! would return its own search advertising, but supplement its inventory with Google search advertisements when Yahoo! did not have sufficient inventory.58, Additionally, Google has recently eliminated its "preferred placement" restriction for its online partners."
Rule of reasons analysis shows strong evidence of market protected by high entry barriers
Despite limitations to evidence, market is inarguably not robustly competitive today
- Google has been unilaterally reducing revenue share with apparent impunity

IV. POTENTIAL REMEDIES

A. Scraping

At least two possible remedies
Opt-out to remove snippets of content from Google's vertical properties, while retaining web search results and/or in Universal Search results on main SERP
Google could be required to limit use of content it indexes for web search (could only use content in returning the property in its search results, but not for determining its own product or local rankings) unless given explicit permission

B. API Restrictions

Require Google to remove problematic contractual restrictions; no technical fixes necessary
- SEMs report that technology for cross-compatibility already exists, will quickly flourish if unhindered by Google's contractual constraints

C. Exclusive and Restrictive Syndication Agreements

Most appropriate remedy is to enjoin Google form entering exclusive agreement with search syndication partners, and to require Google to loosen restrictions surrounding AdSense partners' use of rival search ads

V. LITIGATION RISKS

Google does not charge customers, and they are not locked into Google
Universal Search has resulted in substantial benefit to users
Google's organization and aggregation of content adds value to product for customers
Largest advertisers advertise on both Google AdWords and Microsoft AdCenter
Most efficient channel through which Bing can gain scale is Bing.com
Microsoft has the resources to purchase distribution where it seems greatest value
Most website publishers appy with AdSense

VI. CONCLUSION

"Staff concludes that Google's conduct has resulted - and will result - in real harm to consumers and to innovation in the online search and advertising markets. Google has strengthened its monopolies over search and search advertising through anticompetitive means, and has forestalled competitors' and would-be competitors' ability to challenge those monopolies, and this will have lasting negative effects on consumer welfare"
- "Google has unlawfully maintained its monopoly over general search and search advertising, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by scraping content from rival vertical websites in order to improve its own product offerings."
- "Google has unlawfully maintained its monopoly over general search, search advertising, and search syndication, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by entering into exclusive and highly restrictive agreements with web publishers that prevent publishers from displaying competing search results or search advertisements."
- "Google has unlawfully maintained its monopoly over general search and search advertising, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by maintaining contractual restrictions that inhibit the cross-platform management of advertising campaigns."
"For the reasons set forth above, Staff recommends that the Commission issue the attached complaint."
Memo submitted by Barbara R. Blank, approved by Geoffrey M. Green and Malanie Sabo

FTC BE staff memo

"Bureau of Economics

August 8, 2012

From: Christopher Adams and John Yun, Economists"

Executive Summary

Anticompetitive investigation started June 2011
Staff presented theories and evidence February 2012
This memo offers our final recommendation
Four theories of harm
- preferencing of search results by favoring own web properties over rivals
- exclusive agreements with publishers and vendors, deprive rival platforms of users and advertisers
- restrictions on porting advertiser data to rival platforms
- misappropriating content from Yelp and TripAdvisor
"our guiding approach must be beyond collecting complaints and antidotes [presumably meant to be anecdotes?] from competitors who were negatively impacted from a firm's various business practices."
Market power in search advertising
- Google has "significant' share, 65% of paid clicks and 53% of ad impressions among top 5 U.S. search engines
- Market power may be mitigated by the fact that 80% use a search engine other than Google
- Empirical evidence consistent with search and non-search ads being substitutes, and that Google considers vertical search to be competitors
Preferencing theory
- Theory is that Google is blending its proprietary content with customary "blue links" and demoting competing sites
- Google has limited ability to impose significant harm on vertical rivals because it accounts for 10% to 20% of traffic to them. Effect is very small and not statistically significant
  - [Funny that something so obviously wrong at the time and also seemingly wrong in retrospect was apparently taken seriously]
- Universal Search was a procompetitive response to pressure from vertical sites and an improvement for users
Exclusive agreements theory
- Access to a search engine's site (i.e., not dependent on 3rd party agreement) is most efficient and common distribution channel, which is not impeded by Google. Additionally, strong reasons to doubt that search toolbars and default status on browsers can be viewed as "exclusives" because users can easily switched (on desktop and mobile)
  - [statement implies another wrong model of what's happening here]
  - [Specifically on easy switching on mobile, there's Googe's actual blocking of changing the default search engine from Google to what the user wants, but we also know that a huge fraction of users basically don't understand what's happening and can't make an informed decision to switch — if this weren't the case, it wouldn't make sense for companies to bid so high for defaults, e.g. supposedly $26B/yr to obtain default search engine status on iOS; if users simply switch freely with, default status would be worth close to $0. Since this payment is, at the margin, pure profit and Apple's P/E ratio is 29.53 as of my typing this sentence, a quick and dirty estimate is that $776B of Apple's market cap is attributable to taking this payment vs. randomly selecting a default]
- [In addition to explicit, measurable, coercion like the above, there were also things like Google pressuring Samsung into shutting down their Android Browser effort in 2012; although enforcing a search engine default on Android was probably not the primer driver on that or other similar pressure that Google applied, many of these sorts of things also had the impact of funneling users into Google on mobile; these economists seem like the incentive-based argument that users will use the best product, so the result we see in the market, but if that's the case, why do companies spend so much effort on ecosystem lock-in, including but not limited to supposedly paying $18B/yr to own the default setting in one browser? I guess the argument here is that companies are behaving completely irrationally in expending so much effort here, but consumers are behaving perfectly rationally and are fully informed and are not influenced by all of this spending at all?]
- In search syndication, Microsoft and Yahoo have a combined greater share than Google's
- No support for assertion that rivals' access to users has been impaired by Google. MS and Yahoo have had a steady 30% share for year; query volume has grown faster than Google since alliance was announced
  - [Another odd statement; at the time, observers didn't see Bing staying competitive without heavy subsidies from MS, and then MS predictably stopped subsidizing Bing as a big bet and its market share declined. Google's search market share is well above 90% and hasn't been below 90% since the BE memo was written; in the U.S., estimates put Google around 90% share, some a bit below and some a bit above, with low estimates at something like 87%. It's odd that someone could look at the situation at the time and not seeing that this was about to happen]
- In December 2011, Microsoft had access to query volume equivalent to what Google had 2 years ago, thus difficult to infer that Microsoft is below some threshold of query volume
  - [this exact argument was addressed in the BC memo; the BE memo does not appear to refute the BC memo's argument]
  - [As with a number of the above arguments, this is a strange argument if you understand the dynamics of fast-growing tech companies. When you have rapidly growing companies in markets with network effects or scale effects, being the same absolute size as a competitor a number of years ago doesn't mean that you're in an ok position. We've seen this play out in a ton of markets and it's fundamental to why VCs shovel so much money at companies in promising markets — being a couple years behind often means you get crushed or, if you're lucky, end up as an also ran that's fighting an uphill battle against scale effects]
- Characteristics of online search market not consistent with Google buying distribution agreements to raise input costs of rivals
Restrictions on porting advertiser data to AdWords API
- Theory is that Google's terms and conditions for AdWords API anticompetitively disadvantages Microsoft's adCenter
- Introduction of API with co-mingling restriction made users and Google better off and rivals's costs were unaffected. Any objection therefore implies that when Google introduced the API, it had an obligation to allow its rivals to benefit from increased functionality. Significant risks to long-term innovation incentives from imposing such an obligation [Huh, this seems very weird]
- Advertisers responsible for overwhelming majority of search ad spend use both Google and Microsoft. Multi-homing advertisers of all sizes spend a significant share of budget on Microsoft [this exact objection is addressed in BC memo]
- Evidence from SEMs and end-to-end advertisers suggest policy's impact on ad spend on Microsoft's platform is negligible [it's hard to know how seriously to take this considering the comments on Yelp, above — the model of how tech businesses work seems very wrong, which casts doubt on other conclusions that necessarily require having some kind of model of how this stuff works]
Scraping allegation is that Google has misappropriated content from Yelp and TripAdvisor
- Have substantive concerns. Solution proposed in Annex 11
- To be an antitrust violation, need strong evidence that it increased users on Google at expensive of Yelp or TripAdvisor or decreased incentives to innovate. No strong evidence of either [per above comments, this seems wrong]
Recommendation: recommend investigation be closed

1. Does Google possess monopoly power in the relevant antitrust market?

To be in violation of Section 2 of the Sherman Act, Google needs to be a monopoly or have substantial market power in a relevant market
Online search similar to any other advertising
Competition between platforms and advertisers depends on extent to which advertisers consider users on one platform to be substitutes for another
Google's market power depends on share of internet users
If advertisers can access Google's users at other search platforms, such as Yahoo, Bing, and Facebook, "Google's market power is a lot less"
Substantial evidence contradicting proposition that GOogle has substantial market power in search advertising
Google's share is large. In Feb 2012, 65% of paid search clicks of top 5 general search engines went through Google, up from 55% in Sep 2008; these figures show Google offers advertisers what they want
Advertisers want "eyeballs"
Users multi-home. About 80% of users use a platform other than Google in a given month, so advertisers can get the same eyeballs elsewhere
- Advertiser can get in front of a user on a different query on Yahoo or another search engine
- [this is also odd reasoning — if a user uses Google for searches by default, but occasionally stumbles across Yahoo or Bing, this doesn't meaningfully move the needle for an advertiser; the evidence here is comScore saying that 20% of users only use Google, 15% never use Google, and 65% use Google + another search engine; but it's generally accepted that comScore numbers are quite off. Shortly after the report was written, I looked at various companies that reported metrics (Alexa, etc.) and found them to be badly wrong; I don't think it would be easy to dig up the exact info I used at the time now, but on searching for "comscore search engine market accuracy", the first hit I got was someone explaining that while, today, comScore shows that Google has an implausibly low 67% market share, an analysis of traffic to sites this company has access to showed that Google much more plausibly drove 85% of clicks; it seems worth mentioning that comScore is often considered inaccurate]
Firm-level advertising between search ads and display ads is negatively correlated
- [this seems plausible? The evidence in the BC memo for these being complements seemed like a stretch; maybe it's true, but the BE memo's position seems much more plausible]
- No claim that these are the same market, but can't conclude that they're unrelated
Google competes with specialized search engines, similar to a supermarket competing with a convenience store [details on this analogy elided; this memo relies heavily on analogies that relate tech markets to various non-tech markets, some of which were also elided above]
- For advertising on a search term like "Nikon 5100", Amazon may provide a differentiated but competing product
Google is leading seller of search, but this is mitigated by large proportion of users who also user other search engines, by substitution of display and search advertising, by competition in vertical search

Theory 1: The preferencing theory

2.1 Overview

Preferencing theory is that Google's blending of content such as shopping comparison results and local business listings with customary blue links disadvantages competing content sites, such as Nextag, eBay, Yelp, and TripAdvisor

2.2 Analysis

Blend has two effects, negatively impacting traffic to specialized vertical sites by pushing down sites and impacting Google's incentives to show competing vertical sites
Empirical questions
- "To what extent does Google account for the traffic to vertical sites?"
- "To what extent do blends impact the likelihood of clicks to vertical sites?"
- "To what extent do blends improve consumer value from the search results?"

2.3 Empirical evidence

Google search responsible for 10% of traffic to shopping comparison sites, 17.5% to local business search sites. "See Annex 4 for a complete discussion of our platform model"
- [Annex 4", doesn't appear to be included; but, as discussed above, the authors' model of how traffic works seems to be wrong]
When blends appear, from Google's internal data, clicks to other shopping comparison sites drop by a large and statistically significant amount. For example, if a site had a pre-blend CTR of 9%, post-blend CTR would be 5.3%, but a blend isn't always presented
For local, pre-blend CTR of 6% would be reduced to 5.4%; local blends have smaller impact than shopping
"above result for shopping comparison sites is not the same as finding that overall traffic from Google to shopping sites declined due to universal search. As we describe below, if blends represent a quality improvement, this will increase demand and drive greater query volume on Google, which will boost traffic to all sites."
All links are substitutes, so we can infer that if user user clicks on ads less, they prefer the content and the user is getting more value. Overall results indicate that blends significantly increase consumer value
- [this seems obviously wrong unless the blend is presented with the same visual impact, weight, and position, as normal results, which isn't the case at all — I don't disagree that the blend is probably better for consumers, but this methodology seems like a classic misuse of data to prove a point]

2.4 Documentary evidence

Since the 90s, general search engines have incorporated vertical blends
All major search engines use blends

2.5 Summary of the preferencing theory

Google not significant enough source of traffic to forclose its vertical rivals [as discussed above, the model for this statement is wrong]

Theory 2: Exclusionary practices in search distribution

3.1 Overview

Theory is that Google is engaging in exclusionary practices in order to deprive Microsoft of economies of scale
Foundational issues
- Are Google's distribution agreements substantially impairing opportunity of rivals to compete for users?
- What's the empirical evidence users are being excluded and denied?
- What's the evidence that Microsoft is at a disadvantage in terms of scale?

3.2 Are the various Google distribution agreements in fact exclusionary?

"Exclusionary agreements merit scrutiny when they materially reduce consumer choice and substantially impair the opportunities of rivals"
On desktop, users can access search engine directly, via web browser search box, or a search toolbar
73% of desktop search through direct navigation, all search engines have equal access to consumers in terms of direct access; "Consequently, Google has no ability to impair the opportunities of rivals in the most important and efficient desktop distribution channel."
- [once again, this model seems wrong — if it wasn't wrong, companies wouldn't pay so much to become a search default, including shady stuff like Google paying shady badware installers to make Chrome / Google default on people's desktops. Another model is that if a user uses a search engine because it's a default, this changes a the probability that they'll use the search engine via "direct access"; compared to the BE staff model, it's overwhelmingly likely that this model is correct and the BE staff model is wrong]
- Microsoft is search default on Internet Explorer and 70% of PCs sold
For syndication agreement, Google has a base template that contains premium placement provision. This is to achieve minimum level of remuneration in return for Google making its search available. Additionally, clause is often subject to negotiation and can be modified
- [this negotiation thing is technically correct, but doesn't address the statement about this brought up in the BC memo; many, perhaps most, of the points in this memo have been refuted by the BC memo, and the strategy here seems to be to ignore the refutations without addressing them]
- "By placing its entire site or suite of suites up for bid, publishers are able to bargain more effectively with search engines. This intensifies the ex ante competition for the contract and lowers publishers' costs. Consequently, eliminating the ability to negotiate a bundled discount, or exclusivity, based on site-wide coverage will result in higher prices to publishers." [this seems to contradict what we observe in practice?]
- "This suggests that to the extent Google is depriving rivals such as Microsoft of scale economies, this is a result of 'competition on the merits'— much the same way as if Google had caused Microsoft to lose traffic because it developed a better product and offered it at a lower price."
Have Google's premium placement requirements effectively denied Microsoft access to publishers?
- Can approach this by considering market share. Google 44%, including Aol and Ask. MS 31%, including Yahoo. Yahoo 25%. Combined, Yahoo and MS are at 56%. "Thus, combined, Microsoft and Yahoo's syndication shares are higher than their combined shares in a general search engine market" [as noted previously, these stats didn't seem correct at the time and have gotten predictably less directionally correct over time]
What would MS's volume be without Google's exclusionary restrictions
- At most a 5% change because Google's product is so superior [this seems to ignore the primary component of this complaint, which is that there's a positive feedback cycle]
Search syndication agreements
- Final major distribution channel is mobile search
- U.S. marketshare: Android 47%, iOS 30%, RIM 16%, MS 5%
- Android and iOS grew from 30% to 77% from December 2009 to December 2011, primarily due to decline of RIM, MS, and Palm
- Mobile search is 8%. Thus, "small percentage of overall queries and and even smaller percentage of search ad revenues"
  - [The implication here appears to be that mobile is small and unimportant, which was obviously untrue at the time to any informed observer — I was at Google shortly after this was written and the change was made to go "mobile first" on basically everything because it was understood that mobile was the future; this involved a number of product changes that significantly degraded the experience on desktop in order to make the mobile experience better; this was generally considered not only a good decision, but the only remotely reasonable decision. Google was not alone in making this shift at the time. How economists studying this market didn't understand this after interviewing folks at Google and other tech companies is mysterious]
- Switching cost on mobile implied to be very low, "a few taps" [as noted previously, the staggering amount of money spent on being a mobile default and Google's commit linked above indicate this is not true]
- Even if switching costs were significant, there's no remedy here. "Too many choices lead to consumer confusion"
- Repeat of point that barrier to switching is low because it's "a few taps"
- "Google does not require Google to be the default search engine in order to license the Android OS" [seems technically correct, but misleading at best when taken as part of the broader argument here]
- OEMs choose Google search as default for market-based reasons and not because their choice is restricted [this doesn't address the commit linked above that actually prevents users from switching the default away from Google; I wonder what the rebuttal to that would be, perhaps also that user choice is bad and confusing to users?]
Opportunities available to Microsoft are larger than indicated by marketshare
Summary
- Marketshare could change quickly; two years ago, Apple and Google only had 30% share
- Default of Google search not anticompetitive and mobile a small volume of queries, "although this is changing rapidly"
- Basically no barrier to user switching, "a few taps and downloading other search apps can be achieved in a few seconds. These are trivial switching costs" [as noted above, this is obviously incorrect to anyone who understands mobile, especially the part about downloading an app not being a barrier; I continue to find it interesting that the economists used market-based reasoning when it supports the idea that the market is perfectly competitive, with no switching costs, etc., but decline to use market-based reasoning, such as noting the staggeringly high sums paid to set default search, when it supports the idea the that the the market is not a perfectly competitive market with no switching costs, etc.]

3.3 Are rival search engines being excluded from the market?

Prior section found that Google's distribution agreements don't impair opportunity of rivals to reach users. But could it have happened? We'll look at market shares and growth trends to determine
"We note that the evidence of Microsoft and Yahoo's share and growth cannot, even in theory, tell us whether Google's conduct has had a significant impact. Nonetheless, if we find that rival shares have grown or not diminished, this fact can be informative. Additionally, assuming that Microsoft would have grown dramatically in the counterfactual, despite the fact that Google itself is improving its product, requires a level of proof that must move beyond speculation." [as an extension of the above, the economists are happy to speculate or even 'move beyond speculation' when it comes to applying speculative reasoning on user switching costs, but apparently not when it comes to inferences that can be made about marketshare; why the drastic difference in the standard of proof?]
Microsoft and Yahoo's share shows no design of being excluded, steady 30% for 4 years [as noted in a previous section, the writing was on the wall for Bing and Yahoo at this time, but apparently this would "move beyond speculation" and is not noted here]
Since announcement of MS / Yahoo alliance, MS query volume as grown faster than Google [this is based on comScore qSerach data and the more detailed quoted claim is that MS Query volume increased 134% while Google volume increased 54%; as noted above, this seems like an inaccurate metric, so it's not clear why this would be used to support this point, and it's also misleading at best]
MS-Yahoo have the same number of search engine users as Google in a given month [again, as noted above, this appears to come from incorrect data and is also misleading at best because it counts a single use in a month as equivalent to using something many times a day]

3.4 Does Microsoft have sufficient scale to be competitive?

In a meeting with Susan Athey, Microsoft could not demonstrate that they had data definitively showing how the cost curve changes as click data changes, "thus, there is basis for suggesting Microsoft is below some threshold point" [the use of the phrase "threshold point" demonstrates either a use of sleight of hand or a lack of understanding of how it works; the BE memo seems to prefer the idea that it's about some threshold since this could be supported by the argument that, if such a threshold were to be demonstrated, Microsoft's growth would have or will carry it past the threshold, but it doesn't make any sense that there would a threshold; also, even if this were important, having a single meeting where Microsoft wasn't able to answer this immediately would be weak evidence]
[many more incorrect comments in the same vein as the above omitted for brevity]
"Finally, Microsoft's public statements are not consistent with statements made to antitrust regulators. Microsoft CEO Steve Ballmer stated in a press release announcing the search agreement with Yahoo: 'This agreement with Yahoo! will provide the scale we need to deliver even more rapid advances in relevancy and usefulness. Microsoft and Yahoo! know there's so much more that search could be. This agreement gives us the scale and resources to create the future of search."
- [it's quite bizarre to use a press release, which are generally understood to be meaningless puff pieces, as evidence that a strongly supported claim isn't true; again, BE staff seem to be extremely selective about what evidence they look at to a degree that is striking; for example from conversations I had with credible, senior, engineers who worked on search at both Google and Bing, engineers who understand the domain would agree that having more search volume and more data is a major advantage; instead of using evidence like that, BE staff find a press release that, in the tradition of press releases, has some meaningless and incorrect bragging, and bring that in as evidence; why would they do this?]
[more examples of above incorrect reasoning, omitted for brevity]

3.5 Theory based on raising rivals' costs

Despite the above, it could be that distribution agreements deny rivals and data enough that "feedback effects" are triggered
Possible feedback effects
- Scale effect: cost per unit of quality or ad matching decreases
- Indirect network effect: more advertisers increases number of users
- Congestion effect
- Cash flow effect
Scale effect was determined to not be applicable[as noted there, the argument for this is completely wrong]
Indirect network effect has weak evidence, evidence exists that it doesn't apply, and even if it did apply, low click-through rate of ads shows that most consumers don't like ads anyway [what? This doesn't seem relevant?], and also, having a greater number of advertises leads to congestion and reduction in the value of the platform to advertisers [this is a reach; there is a sense in which this is technically true, but we could see then and now that platforms with few advertisers are extremely undesirable to advertises because advertisers generally don't want to advertise on a platform that full of low quality ads (and this also impacts the desire of users to use the platform)]
Cash flow effect not relevant because Microsoft isn't cash flow constrained, so cost isn't relevant [a funny comment to make because, not too long after this, Microsoft severely cut back investment in Bing because the returns weren't deemed to be worth it; it seems odd for economists to argue that, if you have a lot of money, the cost of things doesn't matter and ROI is irrelevant. Shouldn't they think about marginal cost and marginal revenue?]

[I stopped taking detailed notes at this point because taking notes that are legible to other people (as opposed to just for myself) takes about an order of magnitude longer, and I didn't think that there was much of interest here. I generally find comments of the form "I stopped reading at X" to be quite poor, in that people making such comments generally seem to pick some trivial thing that's unimportant and then declare and entire document to be worthless based on that. This pattern is also common when it comes to engineers, institutions, sports players, etc. and I generally find it counterproductive in those cases as well. However, in this case, there isn't really a single, non-representative, issue. The majority of the reasoning seems not just wrong, but highly disconnected from the on-the-ground situation. More notes indicating that the authors are making further misleading or incorrect arguments in the same style don't seem very useful. I did read the rest of the document and I also continue to summarize a few bits, below. I don't want to call them "highlights" because that would imply that I pulled out particularly interesting or compelling or incorrect bits and it's more of a smattering of miscellaneous parts with no particular theme]

There's a claim that removing restrictions on API interoperability may not cause short term problems, but may cause long-term harm due to how this shifts incentives and reduces innovation and this needs to be accounted for, not just the short-term benefit [in form, this is analogous to the argument Tyler Cowen recently made that banning non-competes reduces the incentives for firms to innovate and will reduce innovation]
The authors seem to like refer to advertisements and PR that any reasonable engineer (and I would guess reasonable person) would know are not meant to be factual or accurate. Similar to the PR argument above, they argue that advertising for Microsoft adCenter claims that it's easy to import data from AdWords, therefore the data portability issue is incorrect, and they specifically say that these advertising statements are "more credible than" other evidence
- They also relied on some kind of SEO blogspam that restates the above as further evidence of this
The authors do not believe that Google Search and Google Local are complements or that taking data from Yelp or TripAdvisor and displaying it above search results has any negative impact on Yelp or TripAdvisor, or at least that "the burden of proof would be extremely difficult"

Other memos

[for these, I continued writing high-level summaries, not detailed summaries]

After the BE memo, there's a memo from Laura M. Sullivan, Division of Advertising Practices, which makes a fairly narrow case in a few dimensions, including "we continue to believe that Google has not deceived consumers by integrating its own specialized search results into its organic results" and, as a result, they suggest not pursuing further action.
- There are some recommendations, such as "based on what we have observed of these new paid search results [referring to Local Search, etc.], we believe Google can strengthen the prominence and clarity of its disclosure" [in practice, the opposite has happened!]
- [overall, the specific points presented here seems like ones a reasonable person could agree with, though whether or not these points are strong enough that they should prevent anti-trust action could be debated]
- " Updating the 2002 Search Engine Letter is Warranted"
  - "The concerns we have regarding Google's disclosure of paid search results also apply to other search engines. Studies since the 2002 Search Engine letter was issued indicate that the standard methods search engines, including Google, Bing, and Yahoo!, have used to disclose their paid results may not be noticeable or clear enough for consumers. ²¹ For example, many consumers do not recognize the top ads as paid results ... Documents also indicate Google itself believed that many consumers generally do not recognize top ads as paid. For example, in June 2010, a leading team member of Google's in-house research group, commenting on general search research over time, stated: 'I don't think the research is inconclusive at all - there's definitely a (large) group of users who don't distinguish between sponsored and organic results. If we ask these users why they think the top results are sometimes displayed with a different background color, they will come up with an explanation that can range from "because they are more relevant" to "I have no idea" to "because Google is sponsoring them."' [this could've seemed reasonable at the time, but in retrospect we can see that the opposite of this has happened and ads are less distinguishable from search results than they were in 2012, likely meaning that even fewer consumers can distinguish ads from search results]
- On the topic of whether or not Google should be liable for fraudulent ads such as ones for fake weight-loss products or fake mortgage relief services, "there is no indication so far that Google has played any role in developing or creating the search ads we are investigating" and Google is expending some effort to prevent these ads and Google can claim CDA immunity, so further investigation here isn't worthwhile
There's another memo from the same author on whether or not using other consumer data in conjunction with its search advertising business is unfair; the case is generally that this is not unfair and consumers should expect that their data is used to improve search queries
There's a memo from Ken Heyer (at the time, a Director of the Agency's Bureau of Economics)
- Suggests having a remedy that seems "quite likely to do more good than harm" before "even considering serious filing a Complaint"
- Seems to generally be in agreement with BE memo
  - On distribution, agrees with economist memo on unimportance of mobile and that Microsoft has good distribution on desktop (due to IE being default on 70% of PCs sold)
  - On API restrictions, mixed opinion
  - On mobile, mostly agrees with BE memo, but suggests getting an idea of how much Google pays for the right be default "since if default status is not much of an advantage we would not expect to see large payments being made" and also suggests it would be interesting to know how much switching from the default occurs
    - Further notes that mobile is only 8% of the market, too small to be significant [8% should've been factually incorrect. By late 2012, when this was written, mobile should've been 20% or more of queries; not sure why the economists are so wrong on so many of the numbers]
- On vertical sites, agreement with data analysis from BE memo and generally agrees with BE memo
Another Ken Heyer memo
- More strongly recommendations no action taken than previous memo, recommends against consent decree as well as litigation
Follow-up memo from BC staff (Barbara R. Blank et al.), recommending that staff negotiate a consent order with Google on mobile
- Google has exclusive agreement with the 4 major U.S. wireless carriers and Apple to pre-install Google Search; Apple agreement requires exclusivity
  - Google default on 86% of devices
- BC Staff recommends consent agreement to eliminate these exclusive agreements
- According to Google documents mobile was 9.5% of Google queries in 2010, 17.3% in 2011 [note that this strongly contradicts the claim from the BE memo that mobile is only 8% of the market here]
  - Rapid growth shows that mobile distribution channel is significant, and both Microsoft and Google internal documents recognize that mobile will likely surpass desktop in the near future
- In contradiction to their claims, Sprint and T-mobile agreements appear to mandate exclusivity, and AT&T agreement is de facto exclusive due to tiered revenue sharing arrangement; Verizon agreement is exclusive
- Google business development manager Chris Barton: "So we know with 100% certainty due to contractual terms that: All Android phones on T-Mobile will come with Google as the only search engine out-of-the-box. All Android phones on Verizon will come with Google as the only search engine out-of-the-box. All Android phones on Sprint will come with Google as the only search engine out-of-the-box.I think this approach is really important otherwise Bing or Yahoo can come and steal away our Android search distribution at any time, thus removing the value of entering into contracts with them. Our philosophy is that we are paying revenue share"
- Andy Rubin laid out a plan to reduce revenue share of partners over time as Google gained search dominance and Google has done this over time
- Carriers would not switch even without exclusive agreement due to better monetization and/or bad PR
- When wrapping up Verizon deal, Andy Rubin said "[i]f we can pull this off ... we will own the US market"
Memo from Willard K. Tom, General Counsel
- "In sum, this may be a good case. But it would be a novel one, and as in all such cases, the Commission should think through carefully what it means."
Memo from Howard Shelanski, Director in Bureau of Economics
- Mostly supports the BE memo and the memo from Ken Heyer, except on scraping, where there's support for the BC memo

By analogy to a case that many people in tech are familiar with, consider this exchange between Oracle counsel David Boies and Judge William Alsup on the rangeCheck function, which checks if a range is a valid array access or not given the length of an array and throws an exception if the access is out of range:
- Boies: [argument that Google copied the rangeCheck function in order to accelerate development]
- Alsup: All right. I have — I was not good — I couldn't have told you the first thing about Java before this trial. But, I have done and still do a lot of programming myself in other languages. I have written blocks of code like rangeCheck a hundred times or more. I could do it. You could do it. It is so simple. The idea that somebody copied that in order to get to market faster, when it would be just as fast to write it out, it was an accident that that thing got in there. There was no way that you could say that that was speeding them along to the marketplace. That is not a good argument.
- Boies: Your Honor
- Alsup: [cutting off Boies] You're one of the best lawyers in America. How can you even make that argument? You know, maybe the answer is because you are so good it sounds legit. But it is not legit. That is not a good argument.
- Boies: Your Honor, let me approach it this way, first, okay. I want to come back to rangeCheck. All right.
- Alsup: RangeCheck. All it does is it makes sure that the numbers you're inputting are within a range. And if they're not, they give it some kind of exceptional treatment. It is so — that witness, when he said a high school student would do this, is absolutely right.
- Boies: He didn't say a high school student would do it in an hour, all right.
- Alsup: Less than — in five minutes, Mr. Boies.
Boies previously brought up this function as a non-trivial piece of work and then argues that, in their haste, a Google engineer copied this function from Oracle. As Alsup points out, the function is trivial, so trivial that it wouldn't be worth looking it up to copy and that even a high school student could easily produce the function from scratch. Boies then objects that, sure, maybe a high school student could write the function, but it might take an hour or more and Alsup correctly responds that an hour is implausible and that it might take five minutes.

Although nearly anyone who could pass a high school programming class would find Boeis's argument not just wrong but absurd³, more like a joke than something that someone might say seriously, it seems reasonable for Boies to make the argument because people presiding over these decisions in court, in regulatory agencies, and in the legislature, sometimes demonstrate a lack of basic understanding of tech. Since my background is in tech and not law or economics, I have no doubt that this analysis will miss some basics about law and economics in the same way that most analyses I've read seem miss basics about tech, but since there's been extensive commentary on this case from people with strong law and economics backgrounds, I don't see a need to cover those issues in depth here because anyone who's interested can read another analysis instead of or in addition to this one.
^[return]
Although this document is focused on tech, the lack of hands-on industry-expertise in regulatory bodies, legislation, and the courts, appears to cause problems in other industries as well. An example that's relatively well known due to a NY Times article that was turned into a movie is DuPont's involvement in the popularization of PFAS and, in particular, PFOA. Scientists at 3M and DuPont had evidence of the harms of PFAS going back at least to the 60s, and possibly even as far back as the 50s. Given the severe harms that PFOA caused to people who were exposed to it in significant concentrations, it would've been difficult to set up a production process for PFOA without seeing the harm it caused, but this knowledge, which must've been apparent to senior scientists and decision makers in 3M and DuPont, wasn't understood by regulatory agencies for almost four decades after it was apparent to chemical companies.

By the way, the NY Times article is titled "The Lawyer Who Became DuPont’s Worst Nightmare" and it describes how DuPont made $1B/yr in profit for years while hiding the harms of PFOA, which was used in the manufacturing process for Teflon. This lawyer brought cases against DuPont that were settled for hundreds of millions of dollars; according to the article and movie, the litigation didn't even cost DuPont a single year's worth of PFOA profit. Also, DuPont manage to drag out the litigation for many years, continuing to reap the profit from PFOA. Now that enough evidence has mounted against PFOA, Teflon is now manufactured using PFO2OA or FRD-903, which are newer and have a less well understood safety profile than PFOA. Perhaps the article could be titled "The Lawyer Who Became DuPont's Largest Mild Annoyance".
^[return]
In the media, I've sometimes seen this framed as a conflict between tech vs. non-tech folks, but we can see analogous comments from people outside of tech. For example, in a panel discussion with Yale SOM professor Fiona Scott Morton and DoJ Antitrust Principal Deputy AAG Doha Mekki, Scott Morton noted that the judge presiding over the Sprint/T-mobile merger proceedings, a case she was an expert witness for, had comically wrong misunderstandings about the market, and that it's common for decisions to be made which are disconnected from "market realities". Mekki seconded this sentiment, saying "what's so fascinating about some of the bad opinions that Fiona identified, and there are many, there's AT&T Time Warner, Sabre Farelogix, T-mobile Sprint, they're everywhere, there's Amex, you know ..."

If you're seeing this or the other footnote in mouseover text and/or tied to a broken link, this is an issue with Hugo. At this point, I've spent more than an entire blog post's worth of effort working around Hugo breakage and am trying to avoid spending more time working around issues in a tool that makes breaking changes at a high rate. If you have a suggestion to fix this, I'll try it, otherwise I'll try to fix it when I switch away from Hugo.
^[return]

How web bloat impacts users with slow devices

Sat, 16 Mar 2024 00:00:00 +0000

In 2017, we looked at how web bloat affects users with slow connections. Even in the U.S., many users didn't have broadband speeds, making much of the web difficult to use. It's still the case that many users don't have broadband speeds, both inside and outside of the U.S. and that much of the modern web isn't usable for people with slow internet, but the exponential increase in bandwidth (Nielsen suggests this is 50% per year for high-end connections) has outpaced web bloat for typical sites, making this less of a problem than it was in 2017, although it's still a serious problem for people with poor connections.

CPU performance for web apps hasn't scaled nearly as quickly as bandwidth so, while more of the web is becoming accessible to people with low-end connections, more of the web is becoming inaccessible to people with low-end devices even if they have high-end connections. For example, if I try browsing a "modern" Discourse-powered forum on a Tecno Spark 8C, it sometimes crashes the browser. Between crashes, on measuring the performance, the responsiveness is significantly worse than browsing a BBS with an 8 MHz 286 and a 1200 baud modem. On my 1Gbps home internet connection, the 2.6 MB compressed payload size "necessary" to load message titles is relatively light. The over-the-wire payload size has "only" increased by 1000x, which is dwarfed by the increase in internet speeds. But the opposite is true when it comes to CPU speeds — for web browsing and forum loading performance, the 8-core (2 1.6 GHz Cortex-A75 / 6 1.6 GHz Cortex-A55) CPU can't handle Discourse. The CPU is something like 100000x faster than our 286. Perhaps a 1000000x faster device would be sufficient.

For anyone not familiar with the Tecno Spark 8C, today, a new Tecno Spark 8C, a quick search indicates that one can be hand for USD 50-60 in Nigeria and perhaps USD 100-110 in India. As a fraction of median household income, that's substantially more than a current generation iPhone in the U.S. today.

By worldwide standards, the Tecno Spark 8C isn't even close to being a low-end device, so we'll also look at performance on an Itel P32, which is a lower end device (though still far from the lowest-end device people are using today). Additionally, we'll look at performance with an M3 Max Macbook (14-core), an M1 Pro Macbook (8-core), and the M3 Max set to 10x throttling in Chrome dev tools. In order to give these devices every advantage, we'll be on fairly high-speed internet (1Gbps, with a WiFi router that's benchmarked as having lower latency under load than most of its peers). We'll look at some blogging platforms and micro-blogging platforms (this blog, Substack, Medium, Ghost, Hugo, Tumblr, Mastodon, Twitter, Threads, Bluesky, Patreon), forum platforms (Discourse, Reddit, Quora, vBulletin, XenForo, phpBB, and myBB), and platforms commonly used by small businesses (Wix, Squarespace, Shopify, and WordPress again).

In the table below, every row represents a website and every non-label column is a metric. After the website name column, we have the compressed size transferred over the wire (wire) and the raw, uncompressed, size (raw). Then we have, for each device, Largest Contentful Paint* (LCP*) and CPU usage on the main thread (CPU). Google's docs explain LCP as

Largest Contentful Paint (LCP) measures when a user perceives that the largest content of a page is visible. The metric value for LCP represents the time duration between the user initiating the page load and the page rendering its primary content

LCP is a common optimization target because it's presented as one of the primary metrics in Google PageSpeed Insights, a "Core Web Vital" metric. There's an asterisk next to LCP as used in this document because, LCP as measured by Chrome is about painting a large fraction of the screen, as opposed to the definition above, which is about content. As sites have optimized for LCP, it's not uncommon to have a large paint (update) that's completely useless to the user, with the actual content of the page appearing well after the LCP. In cases where that happens, I've used the timestamp when useful content appears, not the LCP as defined by when a large but useless update occurs. The full details of the tests and why these metrics were chosen are discussed in an appendix.

Although CPU time isn't a "Core Web Vital", it's presented here because it's a simple metric that's highly correlated with my and other users' perception of usability on slow devices. See appendix for more detailed discussion on this. One reason CPU time works as a metric is that, if a page has great numbers for all other metrics but uses a ton of CPU time, the page is not going to be usable on a slow device. If it takes 100% CPU for 30 seconds, the page will be completely unusable for 30 seconds, and if it takes 50% CPU for 60 seconds, the page will be barely usable for 60 seconds, etc. Another reason it works is that, relative to commonly used metrics, it's hard to cheat on CPU time and make optimizations that significantly move the number without impacting user experience.

The color scheme in the table below is that, for sizes, more green = smaller / fast and more red = larger / slower. Extreme values are in black.

Site	Size	M3 Max	M1 Pro	M3/10	Tecno S8C	Itel P32
wire	raw	LCP*	CPU	LCP*	CPU	LCP*	CPU	LCP*	CPU	LCP*	CPU
danluu.com	6kB	18kB	50ms	20ms	50ms	30ms	0.2s	0.3s	0.4s	0.3s	0.5s	0.5s
HN	11kB	50kB	0.1s	30ms	0.1s	30ms	0.3s	0.3s	0.5s	0.5s	0.7s	0.6s
MyBB	0.1MB	0.3MB	0.3s	0.1s	0.3s	0.1s	0.6s	0.6s	0.8s	0.8s	2.1s	1.9s
phpBB	0.4MB	0.9MB	0.3s	0.1s	0.4s	0.1s	0.7s	1.1s	1.7s	1.5s	4.1s	3.9s
WordPress	1.4MB	1.7MB	0.2s	60ms	0.2s	80ms	0.7s	0.7s	1s	1.5s	1.2s	2.5s
WordPress (old)	0.3MB	1.0MB	80ms	70ms	90ms	90ms	0.4s	0.9s	0.7s	1.7s	1.1s	1.9s
XenForo	0.3MB	1.0MB	0.4s	0.1s	0.6s	0.2s	1.4s	1.5s	1.5s	1.8s	FAIL	FAIL
Ghost	0.7MB	2.4MB	0.1s	0.2s	0.2s	0.2s	1.1s	2.2s	1s	2.4s	1.1s	3.5s
vBulletin	1.2MB	3.4MB	0.5s	0.2s	0.6s	0.3s	1.1s	2.9s	4.4s	4.8s	13s	16s
Squarespace	1.9MB	7.1MB	0.1s	0.4s	0.2s	0.4s	0.7s	3.6s	14s	5.1s	16s	19s
Mastodon	3.8MB	5.3MB	0.2s	0.3s	0.2s	0.4s	1.8s	4.7s	2.0s	7.6s	FAIL	FAIL
Tumblr	3.5MB	7.1MB	0.7s	0.6s	1.1s	0.7s	1.0s	7.0s	14s	7.9s	8.7s	8.7s
Quora	0.6MB	4.9MB	0.7s	1.2s	0.8s	1.3s	2.6s	8.7s	FAIL	FAIL	19s	29s
Bluesky	4.8MB	10MB	1.0s	0.4s	1.0s	0.5s	5.1s	6.0s	8.1s	8.3s	FAIL	FAIL
Wix	7.0MB	21MB	2.4s	1.1s	2.5s	1.2s	18s	11s	5.6s	10s	FAIL	FAIL
Substack	1.3MB	4.3MB	0.4s	0.5s	0.4s	0.5s	1.5s	4.9s	14s	14s	FAIL	FAIL
Threads	9.3MB	13MB	1.5s	0.5s	1.6s	0.7s	5.1s	6.1s	6.4s	16s	28s	66s
Twitter	4.7MB	11MB	2.6s	0.9s	2.7s	1.1s	5.6s	6.6s	12s	19s	24s	43s
Shopify	3.0MB	5.5MB	0.4s	0.2s	0.4s	0.3s	0.7s	2.3s	10s	26s	FAIL	FAIL
Discourse	2.6MB	10MB	1.1s	0.5s	1.5s	0.6s	6.5s	5.9s	15s	26s	FAIL	FAIL
Patreon	4.0MB	13MB	0.6s	1.0s	1.2s	1.2s	1.2s	14s	1.7s	31s	9.1s	45s
Medium	1.2MB	3.3MB	1.4s	0.7s	1.4s	1s	2s	11s	2.8s	33s	3.2s	63s
Reddit	1.7MB	5.4MB	0.9s	0.7s	0.9s	0.9s	6.2s	12s	1.2s	∞	FAIL	FAIL

At a first glance, the table seems about right, in that the sites that feel slow unless you have a super fast device show up as slow in the table (as in, max(LCP*,CPU)) is high on lower-end devices). When I polled folks about what platforms they thought would be fastest and slowest on our slow devices (Mastodon, Twitter, Threads), they generally correctly predicted that Wordpress and Ghost would be faster than Substack and Medium, and that Discourse would be much slower than old PHP forums like phpBB, XenForo, and vBulletin. I also pulled Google PageSpeed Insights (PSI) scores for pages (not shown) and the correlation isn't as strong with those numbers because a handful of sites have managed to optimize their PSI scores without actually speeding up their pages for users.

If you've never used a low-end device like this, the general experience is that many sites are unusable on the device and loading anything resource intensive (an app or a huge website) can cause crashes. Doing something too intense in a resource intensive app can also cause crashes. While reviews note that you can run PUBG and other 3D games with decent performance on a Tecno Spark 8C, this doesn't mean that the device is fast enough to read posts on modern text-centric social media platforms or modern text-centric web forums. While 40fps is achievable in PUBG, we can easily see less than 0.4fps when scrolling on these sites.

We can see from the table how many of the sites are unusable if you have a slow device. All of the pages with 10s+ CPU are a fairly bad experience even after the page loads. Scrolling is very jerky, frequently dropping to a few frames per second and sometimes well below. When we tap on any link, the delay is so long that we can't be sure if our tap actually worked. If we tap again, we can get the dreaded situation where the first tap registers, which then causes the second tap to do the wrong thing, but if we wait, we often end up waiting too long because the original tap didn't actually register (or it registered, but not where we thought it did). Although MyBB doesn't serve up a mobile site and is penalized by Google for not having a mobile friendly page, it's actually much more usable on these slow mobiles than all but the fastest sites because scrolling and tapping actually work.

Another thing we can see is how much variance there is in the relative performance on different devices. For example, comparing an M3/10 and a Tecno Spark 8C, for danluu.com and Ghost, an M3/10 gives a halfway decent approximation of the Tecno Spark 8C (although danluu.com loads much too quickly), but the Tecno Spark 8C is about three times slower (CPU) for Medium, Substack, and Twitter, roughly four times slower for Reddit and Discourse, and over an order of magnitude faster for Shopify. For Wix, the CPU approximation is about accurate, but our `Tecno Spark 8C is more than 3 times slower on LCP*. It's great that Chrome lets you conveniently simulate a slower device from the convenience of your computer, but just enabling Chrome's CPU throttling (or using any combination of out-of-the-box options that are available) gives fairly different results than we get on many real devices. The full reasons for this are beyond the scope of the post; for the purposes of this post, it's sufficient to note that slow pages are often super-linearly slow as devices get slower and that slowness on one page doesn't strongly predict slowness on another page.

If take a site-centric view instead of a device-centric view, another way to look at it is that sites like Discourse, Medium, and Reddit, don't use all that much CPU on our fast M3 and M1 computers, but they're among the slowest on our Tecno Spark 8C (Reddit's CPU is shown as ∞ because, no matter how long we wait with no interaction, Reddit uses ~90% CPU). Discourse also sometimes crashed the browser after interacting a bit or just waiting a while. For example, one time, the browser crashed after loading Discourse, scrolling twice, and then leaving the device still for a minute or two. For consistency's sake, this wasn't marked as FAIL in the table since the page did load but, realistically, having a page so resource intensive that the browser crashes is a significantly worse user experience than any of the FAIL cases in the table. When we looked at how web bloat impacts users with slow connections, we found that much of the web was unusable for people with slow connections and slow devices are no different.

Another pattern we can see is how the older sites are, in general, faster than the newer ones, with sites that (visually) look like they haven't been updated in a decade or two tending to be among the fastest. For example, MyBB, the least modernized and oldest looking forum is 3.6x / 5x faster (LCP* / CPU) than Discourse on the M3, but on the Tecno Spark 8C, the difference is 19x / 33x and, given the overall scaling, it seems safe to guess that the difference would be even larger on the Itel P32 if Discourse worked on such a cheap device.

Another example is Wordpress (old) vs. newer, trendier, blogging platforms like Medium and Substack. Wordpress (old) is is 17.5x / 10x faster (LCP* / CPU) than Medium and 5x / 7x faster (LCP* / CPU) faster than Substack on our M3 Max, and 4x / 19x and 20x / 8x faster, respectively, on our Tecno Spark 8C. Ghost is a notable exception to this, being a modern platform (launched a year after Medium) that's competitive with older platforms (modern Wordpress is also arguably an exception, but many folks would probably still consider that to be an old platform). Among forums, NodeBB also seems to be a bit of an exception (see appendix for details).

Sites that use modern techniques like partially loading the page and then dynamically loading the rest of it, such as Discourse, Reddit, and Substack, tend to be less usable than the scores in the table indicate. Although, in principle, you could build such a site in a simple way that works well with cheap devices but, in practice sites that use dynamic loading tend to be complex enough that the sites are extremely janky on low-end devices. It's generally difficult or impossible to scroll a predictable distance, which means that users will sometimes accidentally trigger more loading by scrolling too far, causing the page to lock up. Many pages actually remove the parts of the page you scrolled past as you scroll; all such pages are essentially unusable. Other basic web features, like page search, also generally stop working. Pages with this kind of dynamic loading can't rely on the simple and fast ctrl/command+F search and have to build their own search. How well this works varies (this used to work quite well in Google docs, but for the past few months or maybe a year, it takes so long to load that I have to deliberately wait after opening a doc to avoid triggering the browser's useless built in search; Discourse search has never really worked on slow devices or even not very fast but not particular slow devices).

In principle, these modern pages that burn a ton of CPU when loading could be doing pre-work that means that later interactions on the page are faster and cheaper than on the pages that do less up-front work (this is a common argument in favor of these kinds of pages), but that's not the case for pages tested, which are slower to load initially, slower on subsequent loads, and slower after they've loaded.

To understand why the theoretical idea that doing all this work up-front doesn't generally result in a faster experience later, this exchange between a distinguished engineer at Google and one of the founders of Discourse (and CEO at the time) is illustrative, in a discussion where the founder of Discourse says that you should test mobile sites on laptops with throttled bandwidth but not throttled CPU:

Google: *you* also don't have slow 3G. These two settings go together. Empathy needs to extend beyond iPhone XS users in a tunnel.
Discourse: Literally any phone of vintage iPhone 6 or greater is basically as fast as the "average" laptop. You have to understand how brutally bad Qualcomm is at their job. Look it up if you don't believe me.
Google: I don't need to believe you. I know. This is well known by people who care. My point was that just like not everyone has a fast connection not everyone has a fast phone. Certainly the iPhone 6 is frequently very CPU bound on real world websites. But that isn't the point.
Discourse: we've been trending towards infinite CPU speed for decades now (and we've been asymptotically there for ~5 years on desktop), what we are not and will never trend towards is infinite bandwidth. Optimize for the things that matter. and I have zero empathy for @qualcomm. Fuck Qualcomm, they're terrible at their jobs. I hope they go out of business and the ground their company existed on is plowed with salt so nothing can ever grow there again.
Google: Mobile devices are not at all bandwidth constraint in most circumstances. They are latency constraint. Even the latest iPhone is CPU constraint before it is bandwidth constraint. If you do well on 4x slow down on a MBP things are pretty alright
...
Google: Are 100% of users on iOS?
Discourse: The influential users who spend money tend to be, I’ll tell you that ... Pointless to worry about cpu, it is effectively infinite already on iOS, and even with Qualcomm’s incompetence, will be within 4 more years on their embarrassing SoCs as well

When someone asks the founder of Discourse, "just wondering why you hate them", he responds with a link that cites the Kraken and Octane benchmarks from this Anandtech review, which have the Qualcomm chip at 74% and 85% of the performance of the then-current Apple chip, respectively.

The founder and then-CEO of Discourse considers Qualcomm's mobile performance embarrassing and finds this so offensive that he thinks Qualcomm engineers should all lose their jobs for delivering 74% to 85% of the performance of Apple. Apple has what I consider to be an all-time great performance team. Reasonable people could disagree on that, but one has to at least think of them as a world-class team. So, producing a product with 74% to 85% of an all-time-great team is considered an embarrassment worthy of losing your job.

There are two attitudes on display here which I see in a lot of software folks. First, that CPU speed is infinite and one shouldn't worry about CPU optimization. And second, that gigantic speedups from hardware should be expected and the only reason hardware engineers wouldn't achieve them is due to spectacular incompetence, so the slow software should be blamed on hardware engineers, not software engineers. Donald Knuth expressed a similar sentiment in

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multiithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write. Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX ... I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years. Even if I knew enough about such methods to write about them in TAOCP, my time would be largely wasted, because soon there would be little reason for anybody to read those parts ... The machine I use today has dual processors. I get to use them both only when I’m running two independent jobs at the same time; that’s nice, but it happens only a few minutes every week.

In the case of Discourse, a hardware engineer is an embarrassment not deserving of a job if they can't hit 90% of the performance of an all-time-great performance team but, as a software engineer, delivering 3% the performance of a non-highly-optimized application like MyBB is no problem. In Knuth's case, hardware engineers gave programmers a 100x performance increase every decade for decades with little to no work on the part of programmers. The moment this slowed down and programmers had to adapt to take advantage of new hardware, hardware engineers were "all out of ideas", but learning a few "new" (1970s and 1980s era) ideas to take advantage of current hardware would be a waste of time. And we've previously discussed Alan Kay's claim that hardware engineers are "unsophisticated" and "uneducated" and aren't doing "real engineering" and how we'd get a 1000x speedup if we listened to Alan Kay's "sophisticated" ideas.

It's fairly common for programmers to expect that hardware will solve all their problems, and then, when that doesn't happen, pass the issue onto the user, explaining why the programmer needn't do anything to help the user. A question one might ask is how much performance improvement programmers have given us. There are cases of algorithmic improvements that result in massive speedups but, as we noted above, Discourse, the fastest growing forum software today, seems to have given us an approximately 1000000x slowdown in performance.

Another common attitude on display above is the idea that users who aren't wealthy don't matter. When asked if 100% of users are on iOS, the founder of Discourse says "The influential users who spend money tend to be, I’ll tell you that". We see the same attitude all over comments on Tonsky's JavaScript Bloat post, with people expressing cocktail-party sentiments like "Phone apps are hundreds of megs, why are we obsessing over web apps that are a few megs? Starving children in Africa can download Android apps but not web apps? Come on" and "surely no user of gitlab would be poor enough to have a slow device, let's be serious" (paraphrased for length).

But when we look at the size of apps that are downloaded in Africa, we see that people who aren't on high-end devices use apps like Facebook Lite (a couple megs) and commonly use apps that are a single digit to low double digit number of megabytes. There are multiple reasons app makers care about their app size. One is just the total storage available on the phone; if you watch real users install apps, they often have to delete and uninstall things to put a new app on, so the smaller size is both easier to to install and has a lower chance of being uninstalled when the user is looking for more space. Another is that, if you look at data on app size and usage (I don't know of any public data on this; please pass it along if you have something public I can reference), when large apps increase the size and memory usage, they get more crashes, which drives down user retention, growth, and engagement and, conversely, when they optimize their size and memory usage, they get fewer crashes and better user retention, growth, and engagement.

Alex Russell points out that iOS has 7% market share in India (a 1.4B person market) and 6% market share in Latin America (a 600M person market). Although the founder of Discourse says that these aren't "influential users" who matter, these are still real human beings. Alex further points out that, according to Windows telemetry, which covers the vast majority of desktop users, most laptop/desktop users are on low-end machines which are likely slower than a modern iPhone.

On the bit about no programmers having slow devices, I know plenty of people who are using hand-me-down devices that are old and slow. Many of them aren't even really poor; they just don't see why (for example) their kid needs a super fast device, and they don't understand how much of the modern web works poorly on slow devices. After all, the "slow" device can play 3d games and (with the right OS) compile codebases like Linux or Chromium, so why shouldn't the device be able to interact with a site like gitlab?

Contrary to the claim from the founder of Discourse that, within years, every Android user will be on some kind of super fast Android device, it's been six years since his comment and it's going to be at least a decade before almost everyone in the world who's using a phone has a high-speed device and this could easily take two decades or more. If you look up marketshare stats for Discourse, it's extremely successful; it appears to be the fastest growing forum software in the world by a large margin. The impact of having the fastest growing forum software in the world created by an organization whose then-leader was willing to state that he doesn't really care about users who aren't "influential users who spend money", who don't have access to "infinite CPU speed", is that a lot of forums are now inaccessible to people who don't have enough wealth to buy a device with effectively infinite CPU.

If the founder of Discourse were an anomaly, this wouldn't be too much of a problem, but he's just verbalizing the implicit assumptions a lot of programmers have, which is why we see that so many modern websites are unusable if you buy the income-adjusted equivalent of a new, current generation, iPhone in a low-income country.

Thanks to Yossi Kreinen, Fabian Giesen, John O'Nolan, Joseph Scott, Loren McIntyre, Daniel Filan, @acidshill, Alex Russell, Chris Adams, Tobias Marschner, Matt Stuchlik, @gekitsu@toot.cat, Justin Blank, Andy Kelley, Julian Lam, Matthew Thomas, avarcat, @eamon@social.coop, William Ehlhardt, Philip R. Boulain, and David Turner for comments/corrections/discussion.

Appendix: gaming LCP

We noted above that we used LCP* and not LCP. This is because LCP basically measures when the largest change happens. When this metric was not deliberately gamed in ways that don't benefit the user, this was a great metric, but this metric has become less representative of the actual user experience as more people have gamed it. In the less blatant cases, people do small optimizations that improve LCP but barely improve or don't improve the actual user experience.

In the more blatant cases, developers will deliberately flash a very large change on the page as soon as possible, generally a loading screen that has no value to the user (actually negative value because doing this increases the total amount of work done and the total time it takes to load the page) and then they carefully avoid making any change large enough that any later change would get marked as the LCP.

For the same reason that VW didn't publicly discuss how it was gaming its emissions numbers, developers tend to shy away from discussing this kind of LCP optimization in public. An exception to this is Discourse, where they publicly announced this kind of LCP optimization, with comments from their devs and the then-CTO (now CEO), noting that their new "Discourse Splash" feature hugely reduced LCP for sites after they deployed it. And then developers ask why their LCP is high, the standard advice from Discourse developers is to keep elements smaller than the "Discourse Splash", so that the LCP timestamp is computed from this useless element that's thrown up to optimize LCP, as opposed to having the timestamp be computed from any actual element that's relevant to the user. Here's a typical, official, comment from Discourse

If your banner is larger than the element we use for the "Introducing Discourse Splash - A visual preloader displayed while site assets load" you gonna have a bad time for LCP.

The official response from Discourse is that you should make sure that your content doesn't trigger the LCP measurement and that, instead, our loading animation timestamp is what's used to compute LCP.

The sites with the most extreme ratio of LCP of useful content vs. Chrome's measured LCP were:

Wix
- M3: 6
- M1: 12
- Tecno Spark 8C: 3
- Itel P32: N/A (FAIL)
Discourse:
- M3: 10
- M1: 12
- Tecno Spark 8C: 4
- Itel P32: N/A (FAIL)

Although we haven't discussed the gaming of other metrics, it appears that some websites also game other metrics and "optimize" them even when this has no benefit to users.

Appendix: the selfish argument for optimizing sites

This will depend on the scale of the site as well as its performance, but when I've looked at this data for large companies I've worked for, improving site and app performance is worth a mind boggling amount of money. It's measurable in A/B tests and it's also among the interventions that has, in long-term holdbacks, a relatively large impact on growth and retention (many interventions test well but don't look as good long term, whereas performance improvements tend to look better long term).

Of course you can see this from the direct numbers, but you can also implicitly see this in a lot of ways when looking at the data. One angle is that (just for example), at Twitter, user-observed p99 latency was about 60s in India as well as a number of African countries (even excluding relatively wealthy ones like Egypt and South Africa) and also about 60s in the United States. Of course, across the entire population, people have faster devices and connections in the United States, but in every country, there are enough users that have slow devices or connections that the limiting factor is really user patience and not the underlying population-level distribution of devices and connections. Even if you don't care about users in Nigeria or India and only care about U.S. ad revenue, improving performance for low-end devices and connections has enough of impact that we could easily see the impact in global as well as U.S. revenue in A/B tests, especially in long-term holdbacks. And you also see the impact among users who have fast devices since a change that improves the latency for a user with a "low-end" device from 60s to 50s might improve the latency for a user with a high-end device from 5s to 4.5s, which has an impact on revenue, growth, and retention numbers as well.

For a variety of reasons that are beyond the scope of this doc, this kind of boring, quantifiable, growth and revenue driving work has been difficult to get funded at most large companies I've worked for relative to flash product work that ends up showing little to no impact in long-term holdbacks.

Appendix: designing for low performance devices

When using slow devices or any device with low bandwidth and/or poor connectivity, the best experiences, by far, are generally the ones that load a lot of content at once into a static page. If the images have proper width and height attributes and alt text, that's very helpful. Progressive images (as in progressive jpeg) isn't particularly helpful.

On a slow device with high bandwidth, any lightweight, static, page works well, and lightweight dynamic pages can work well if designed for performance. Heavy, dynamic, pages are doomed unless the page weight doesn't cause the page to be complex.

With low bandwidth and/or poor connectivity, lightweight pages are fine. With heavy pages, the best experience I've had is when I trigger a page load, go do something else, and then come back when it's done (or at least the HTML and CSS are done). I can then open each link I might want to read in a new tab, and then do something else while I wait for those to load.

A lot of the optimizations that modern websites do, such as partial loading that causes more loading when you scroll down the page, and the concomitant hijacking of search (because the browser's built in search is useless if the page isn't fully loaded) causes the interaction model that works to stop working and makes pages very painful to interact with.

Just for example, a number of people have noted that Substack performs poorly for them because it does partial page loads. Here's a video by @acidshill of what it looks like to load a Substack article and then scroll on an iPhone 8, where the post has a fairly fast LCP, but if you want to scroll past the header, you have to wait 6s for the next page to load, and then on scrolling again, you have to wait maybe another 1s to 2s:

As an example of the opposite approach, I tried loading some fairly large plain HTML pages, such as diseconomies-scale/ (0.1 MB wire / 0.4 MB raw) and threads-faq/ (0.4 MB wire / 1.1 MB raw) and these were still quite usable for me even on slow devices. 1.1 MB seems to be larger than optimal and breaking that into a few different pages would be better on a low-end devices, but a single page with 1.1 MB of text works much better than most modern sites on a slow device. While you can get into trouble with HTML pages that are so large that browsers can't really handle them, for pages with a normal amount of content, it generally isn't until you have complex CSS payloads or JS that the pages start causing problems for slow devices. Below, we test pages that are relatively simple, some of which have a fair amount of media (14 MB in one case) and find that these pages work ok, as long as they stay simple.

Chris Adams has also noted that blind users, using screen readers, often report that dynamic loading makes the experience much worse for them. Like dynamic loading to improve performance, while this can be done well, it's often either done badly or bundled with so much other complexity that the result is worse than a simple page.

@Qingcharles noted another accessibility issue — the (prison) parolees he works with are given "lifeline" phones, which are often very low end devices. From a quick search, in 2024, some people will get an iPhone 6 or an iPhone 8, but there are also plenty of devices that are lower end than an Itel P32, let alone a Tecno Spark 8C. They also get plans with highly limited data, and then when they run out, some people "can't fill out any forms for jobs, welfare, or navigate anywhere with Maps".

For sites that do up-front work and actually give you a decent experience on low end devices, Andy Kelley pointed out an example of a site that does up front work that seems to work ok on a slow device (although it would struggle on a very slow connection), the Zig standard library documentation:

I made the controversial decision to have it fetch all the source code up front and then do all the content rendering locally. In theory, this is CPU intensive but in practice... even those old phones have really fast CPUs!

On the Tecno Spark 8C, this uses 4.7s of CPU and, afterwards, is fairly responsive (relative to the device — of course an iPhone responds much more quickly. Taps cause links to load fairly quickly and scrolling also works fine (it's a little jerky, but almost nothing is really smooth on this device). This seems like the kind of thing people are referring to when they say that you can get better performance if you ship a heavy payload, but there aren't many examples of that which actually improve performance on low-end devices.

Appendix: articles on web performance issues

2015: Maciej Cegłowski: The Website Obesity Crisis
- Size: 1.0 MB / 1.1 MB
- Tecno Spark 8C: 0.9s / 1.4s
  - Scrolling a bit jerky, images take a little bit of time to appear if scrolling very quickly (jumping halfway down page from top), but delay is below what almost any user would perceive when scrolling a normal distance.
2015: Nate Berkopec: Page Weight Doesn't Matter
- Size: 80 kB / 0.2 MB
- Tecno Spark 8C: 0.8s / 0.7s
  - Does lazy loading, page downloads 650 kB / 1.8 MB if you scroll through the entire page, but scrolling is only a little jerky and the lazy loading doesn't cause delays. Probably the only page I've tried that does lazy loading in a way that makes the experience better and not worse on a slow device; I didn't test on a slow connection, where this would still make the experience worse.
- Itel P32: 1.1s / 1s
  - Scrolling basically unusable; scroll extremely jerky and moves a random distance, often takes over 1s for text to render when scrolling to new text; can be much worse with images that are lazy loaded. Even though this is the best implementation of lazy loading I've seen in the wild, the Itel P32 still can't handle it.
2017: Dan Luu: How web bloat impacts users with slow connections
- Size: 14 kB / 57 kB
- Tecno Spark 8C: 0.5s / 0.3s
  - Scrolling and interaction work fine.
- Itel P32:0.7s / 0.5 s
2017-2024+: Alex Russell: The Performance Inequality Gap (series)
- Size: 82 kB / 0.1 MB
- Tecno Spark 8C: 0.5s / 0.4s
  - Scrolling and interaction work fine.
- Itel P32: 0.7s / 0.4s
  - Scrolling and interaction work fine.
2024: Nikita Prokopov (Tonsky): JavaScript Bloat in 2024
- Size: 14 MB / 14 MB
- Tecno Spark 8C: 0.8s / 1.9s
  - When scrolling, it takes a while for images to show up (500ms or so) and the scrolling isn't smooth, but it's not jerky enough that it's difficult to scroll to the right place.
- Itel P32: 2.5s / 3s
  - Scrolling isn't smooth. Scrolling accurately is a bit difficult, but can generally scroll to where you want if very careful. Generally takes a bit more than 1s for new content to appear when you scroll a significant distance.
2024: Dan Luu: This post
- Size: 25 kB / 74 kB
- Tecno Spark 8C: 0.6s / 0.5s
  - Scrolling and interaction work fine.
- Itel P32: 1.3s / 1.1s
  - Scrolling and interaction work fine, although I had to make a change for this to be the case — this doc originally had an embedded video, which the Itel P32 couldn't really handle.
    - Note that, while these numbers are worse than the numbers for "Page Weight Doesn't Matter", this page is usable after load, which that other page isn't beacuse it execute some kind of lazy loading that's too complex for this phone to handle in a reasonable timeframe.

Appendix: empathy for non-rich users

Something I've observed over time, as programming has become more prestigious and more lucrative, is that people have tended to come from wealthier backgrounds and have less exposure to people with different income levels. An example we've discussed before, is at a well-known, prestigious, startup that has a very left-leaning employee base, where everyone got rich, on a discussion about the covid stimulus checks, in a slack discussion, a well meaning progressive employee said that it was pointless because people would just use their stimulus checks to buy stock. This person had, apparently, never talked to any middle-class (let alone poor) person about where their money goes or looked at the data on who owns equity. And that's just looking at American wealth. When we look at world-wide wealth, the general level of understanding is much lower. People seem to really underestimate the dynamic range in wealth and income across the world. From having talked to quite a few people about this, a lot of people seem to have mental buckets for "poor by American standards" (buys stock with stimulus checks) and "poor by worldwide standards" (maybe doesn't even buy stock), but the range of poverty in the world dwarfs the range of poverty in America to an extent that not many wealthy programmers seem to realize.

Just for example, in this discussion how lucky I was (in terms of financial opportunities) that my parents made it to America, someone mentioned that it's not that big a deal because they had great financial opportunities in Poland. For one thing, with respect to the topic of the discussion, the probability that someone will end up with a high-paying programming job (senior staff eng at a high-paying tech company) or equivalent, I suspect that, when I was born, being born poor in the U.S. gives you better odds than being fairly well off in Poland, but I could believe the other case as well if presented with data. But if we're comparing Poland v. U.S. to Vietnam v. U.S., if I spend 15 seconds looking up rough wealth numbers for these countries in the year I was born, the GDP/capita ratio of U.S. : Poland was ~8:1, whereas it was ~50 : 1 for Poland : Vietnam. The difference in wealth between Poland and Vietnam was roughly the square of the difference between the U.S. and Poland, so Poland to Vietnam is roughly equivalent to Poland vs. some hypothetical country that's richer than the U.S. by the amount that the U.S. is richer than Poland. These aren't even remotely comparable, but a lot of people seem to have this mental model that there's "rich countries" and "not rich countries" and "not rich countries" are all roughly in the same bucket. GDP/capita isn't ideal, but it's easier to find than percentile income statistics; the quick search I did also turned up that annual income in Vietnam then was something like $200-$300 a year. Vietnam was also going through the tail end of a famine whose impacts are a bit difficult to determine because statistics here seem to be gamed, but if you believe the mortality rate statistics, the famine caused total overall mortality rate to jump to double the normal baseline¹.

Of course, at the time, the median person in a low-income country wouldn't have had a computer, let alone internet access. But, today it's fairly common for people in low-income countries to have devices. Many people either don't seem to realize this or don't understand what sorts of devices a lot of these folks use.

Appendix: comments from Fabian Giesen

On the Discourse founder's comments on iOS vs. Android marketshare, Fabian notes

In the US, according to the most recent data I could find (for 2023), iPhones have around 60% marketshare. In the EU, it's around 33%. This has knock-on effects. Not only do iOS users skew towards the wealthier end, they also skew towards the US.

There's some secondary effects from this too. For example, in the US, iMessage is very popular for group chats etc. and infamous for interoperating very poorly with Android devices in a way that makes the experience for Android users very annoying (almost certainly intentionally so).

In the EU, not least because Android is so much more prominent, iMessage is way less popular and anecdotally, even iPhone users among my acquaintances who would probably use iMessage in the US tend to use WhatsApp instead.

Point being, globally speaking, recent iOS + fast Internet is even more skewed towards a particular demographic than many app devs in the US seem to be aware.

And on the comment about mobile app vs. web app sizes, Fabian said:

One more note from experience: apps you install when you install them, and generally have some opportunity to hold off on updates while you're on a slow or metered connection (or just don't have data at all).

Back when I originally got my US phone, I had no US credit history and thus had to use prepaid plans. I still do because it's fine for what I actually use my phone for most of the time, but it does mean that when I travel to Germany once a year, I don't get data roaming at all. (Also, phone calls in Germany cost me $1.50 apiece, even though T-Mobile is the biggest mobile provider in Germany - though, of course, not T-Mobile US.)

Point being, I do get access to free and fast Wi-Fi at T-Mobile hotspots (e.g. major train stations, airports etc.) and on inter-city trains that have them, but I effectively don't have any data plan when in Germany at all.

This is completely fine with mobile phone apps that work offline and sync their data when they have a connection. But web apps are unusable while I'm not near a public Wi-Fi.

Likewise I'm fine sending an email over a slow metered connection via the Gmail app, but I for sure wouldn't use any web-mail client that needs to download a few MBs worth of zipped JS to do anything on a metered connection.

At least with native app downloads, I can prepare in advance and download them while I'm somewhere with good internet!

Another comment from Fabian (this time paraphrased since this was from a conversation), is that people will often justify being quantitatively hugely slower because there's a qualitative reason something should be slow. One example he gave was that screens often take a long time to sync their connection and this is justified because there are operations that have to be done that take time. For a long time, these operations would often take seconds. Recently, a lot of displays sync much more quickly because Nvidia specifies how long this can take for something to be "G-Sync" certified, so display makers actually do this in a reasonable amount of time now. While it's true that there are operations that have to be done that take time, there's no fundamental reason they should take as much time as they often used to. Another example he gave was on how someone was justifying how long it took to read thousands of files because the operation required a lot of syscalls and "syscalls are slow", which is a qualitatively true statement, but if you look at the actual cost of a syscall, in the case under discussion, the cost of a syscall was many orders of magnitude from being costly enough to be a reasonable explanation for why it took so long to read thousands of files.

On this topic, when people point out that a modern website is slow, someone will generally respond with the qualitative defense that the modern website has these great features, which the older website is lacking. And while it's true that (for example) Discourse has features that MyBB doesn't, it's hard to argue that its feature set justifies being 33x slower.

Appendix: experimental details

With the exception of danluu.com and, arguably, HN, for each site, I tried to find the "most default" experience. For example, for WordPress, this meant a demo blog with the current default theme, twentytwentyfour. In some cases, this may not be the most likely thing someone uses today, e.g., for Shopify, I looked at the first thing that theme they give you when you browse their themes, but I didn't attempt to find theme data to see what the most commonly used theme is. For this post, I wanted to do all of the data collection and analysis as a short project, something that takes less than a day, so there were a number of shortcuts like this, which will be described below. I don't think it's wrong to use the first-presented Shopify theme in a decent fraction of users will probably use the first-presente theme, but that is, of course, less representative than grabbing whatever the most common theme is and then also testing many different sites that use that theme to see how real-world performance varies when people modify the theme for their own use. If I worked for Shopify or wanted to do competitive analysis on behalf of a competitor, I would do that, but for a one-day project on how large websites impact users on low-end devices, the performance of Shopify demonstrated here seems ok. I actually did the initial work for this around when I ran these polls, back in February; I just didn't have time to really write this stuff up for a month.

For the tests on laptops, I tried to have the laptop at ~60% battery, not plugged in, and the laptop was idle for enough time to return to thermal equilibrium in a room at 20°C, so pages shouldn't be impacted by prior page loads or other prior work that was happening on the machine.

For the mobile tests, the phones were at ~100% charge and plugged in, and also previously at 100% charge so the phones didn't have any heating effect you can get from rapidly charging. As noted above, these tests were formed with 1Gbps WiFi. No other apps were running, the browser had no other tabs open, and the only apps that were installed on the device, so no additional background tasks should've been running other than whatever users are normally subject to by the device by default. A real user with the same device is going to see worse performance than we measured here in almost every circumstance except if running Chrome Dev Tools on a phone significantly degrades performance. I noticed that, on the Itel P32, scrolling was somewhat jerkier with Dev Tools running than when running normally but, since this was a one-day project, I didn't attempt to quantify this and if it impacts some sites much more than others. In absolute terms, the overhead can't be all that large because the fastest sites are still fairly fast with Dev Tools running, but if there's some kind of overhead that's super-linear in the amount of work the site does (possibly indirectly, if it causes some kind of resource exhaustion), then that could be a problem in measurements of some sites.

Sizes were all measured on mobile, so in cases where different assets are loaded on mobile vs. desktop, the we measured the mobile asset sizes. CPU was measured as CPU time on the main thread (I did also record time on other threads for sites that used other threads, but didn't use this number; if CPU were a metric people wanted to game, time on other threads would have to be accounted for to prevent sites from trying to offload as much work as possible to other threads, but this isn't currently an issue and time on main thread is more directly correlated to usability than sum of time across all threads, and the metric that would work for gaming is less legible with no upside for now).

For WiFi speeds, speed tests had the following numbers:

M3 Max
- Netflix (fast.com)
  - Download: 850 Mbps
  - Upload: 840 Mbps
  - Latency (unloaded / loaded): 3ms / 8ms
- Ookla
  - Download: 900 Mbps
  - Upload: 840 Mbps
  - Latency (unloaded / download / upload): 3ms / 8ms / 13ms
Tecno Spark 8C
- Netflix (fast.com)
  - Download: 390 Mbps
  - Upload: 210 Mbps
  - Latency (unloaded / loaded): 2ms / 30ms
- Oookla
  - Ookla web app fails, can't see results
Itel P32
- Netflix
  - Download: 44 Mbps
  - Upload: test fails to work (sends one chunk of data and then hangs, sending no more data)
  - Latency (unloaded / loaded): 4ms / 400ms
- Okta
  - Download: 45 Mbps
  - Upload: test fails to work
  - Latency: test fails to display latency

One thing to note is that the Itel P32 doesn't really have the ability to use the bandwidth that it nominally has. Looking at the top Google reviews, none of them mention this. The first review reads

Performance-wise, the phone doesn’t lag. It is powered by the latest Android 8.1 (GO Edition) ... we have 8GB+1GB ROM and RAM, to run on a power horse of 1.3GHz quad-core processor for easy multi-tasking ... I’m impressed with the features on the P32, especially because of the price. I would recommend it for those who are always on the move. And for those who take battery life in smartphones has their number one priority, then P32 is your best bet.

The second review reads

Itel mobile is one of the leading Africa distributors ranking 3rd on a continental scale ... the light operating system acted up to our expectations with no sluggish performance on a 1GB RAM device ... fairly fast processing speeds ... the Itel P32 smartphone delivers the best performance beyond its capabilities ... at a whooping UGX 330,000 price tag, the Itel P32 is one of those amazing low-range like smartphones that deserve a mid-range flag for amazing features embedded in a single package.

The third review reads

"Much More Than Just a Budget Entry-Level Smartphone ... Our full review after 2 weeks of usage ... While switching between apps, and browsing through heavy web pages, the performance was optimal. There were few lags when multiple apps were running in the background, while playing games. However, the overall performance is average for maximum phone users, and is best for average users [screenshot of game] Even though the game was skipping some frames, and automatically dropped graphical details it was much faster if no other app was running on the phone.

Notes on sites:

Wix
- www.wix.com/website-template/view/html/3173?originUrl=https%3A%2F%2Fwww.wix.com%2Fwebsite%2Ftemplates%2Fhtml%2Fmost-popular&tpClick=view_button&esi=a30e7086-28db-4e2e-ba22-9d1ecfbb1250: this was the first entry when I clicked to get a theme
- LCP was misleading on every device
- On the Tecno Spark 8C, scrolling never really works. It's very jerky and this never settles down
- On the Itel P32, the page fails non-deterministically (different errors on different loads); it can take quite a while to error out; it was 23s on the first run, with the CPU pegged for 28s
Patreon
- www.patreon.com/danluu: used my profile where possible
- Scrolling on Patreon and finding old posts is so painful that I maintain my own index of my Patreon posts so that I can find my old posts without having to use Patreon. Although Patreon's numbers in the table don't look that bad in the table when you're on a fast laptop, that's just for the initial load. The performance as you scroll is bad enough that I don't think that, today, there exists a computer and internet connection that browse Patreon with decent performance.
Threads
- threads.net/danluu.danluu: used my profile where possible
- On the Itel P32, this technically doesn't load correctly and could be marked as FAIL, but it's close enough that I counted it. The thing that's incorrect is that profile photos have a square box around then
  - However, as with the other heavy pages, interacting with the page doesn't really work and the page is unusable, but this appears to be for the standard performance reasons and not because the page failed to render
Twitter
- twitter.com/danluu: used my profile where possible
Discourse
- meta.discourse.org: this is what turned up when I searched for an official forum.
- As discussed above, the LCP is highly gamed and basically meaningless. We linked to a post where the Discourse folks note that, on slow loads, they put a giant splash screen up at 2s to cap the LCP at 2s. Also notable is that, on loads that are faster than the 2s, the LCP is also highly gamed. For example, on the M3 Max with low-latency 1Gbps internet, the LCP was reported as 115ms, but the page loads actual content at 1.1s. This appears to use the same fundamental trick as "Discourse Splash", in that it paints a huge change onto the screen and then carefully loads smaller elements to avoid having the actual page content detected as the LCP.
- On the Tecno Spark 8C, scrolling is unpredictable and can jump too far, triggering loading from infinite scroll, which hangs the page for 3s-10s. Also, the entire browser sometimes crashes if you just let the browser sit on this page for a while.
- On the Itel P32, an error message is displayed after 7.5s
Bluesky
- bsky.app/profile/danluu.com
- Displays a blank screen on the Itel P32
Squarespace
- cedar-fluid-demo.squarespace.com: this was the second theme that showed up when I clicked themes to get a theme; the first was one called "Bogart", but that was basically a "coming soon" single page screen with no content, so I used the second theme instead of the first one.
- A lot of errors and warnings in the console with the Itel P32, but the page appears to load and work, although interacting with it is fairly slow and painful
- LCP on the Tecno Spark 8C was significantly before the page content actually loaded
Tumblr
- www.tumblr.com/slatestarscratchpad: used this because I know this tumblr exists. I don't read a lot of tumblers (maybe three or four), and this one seemed like the closest thing to my blog that I know of on tumblr.
- This page fails on the Itel P32, but doesn't FAIL. The console shows that the JavaScript errors out, but the page still works fine (I tried scrolling, clicking links, etc., and these all worked), so you can actually go to the post you want and read it. The JS error appears to have made this page load much more quickly than it other would have and also made interacting with the page after it loaded fairly zippy.
Shopify
- themes.shopify.com/themes/motion/styles/classic/preview?surface_detail=listing&surface_inter_position=1&surface_intra_position=1&surface_type=all: this was the first theme that showed up when I looked for themes
- On the first M3/10 run, Chrome dev tools reported a nonsensical 697s of CPU time (the run completed in a normal amount of time, well under 697s or even 697/10s. This run was ignored when computing results.
- On the Itel P32, the page load never completes and it just shows a flashing cursor-like image, which is deliberately loaded by the theme. On devices that load properly, the flashing cursor image is immediately covered up by another image, but that never happens here.
- I wondered if it wasn't fair to use this example theme because there's some stuff on the page that lets you switch theme styles, so I checked out actual uses of the theme (the page that advertises the theme lists users of the theme). I tried the first two listed real examples and they were both much slower than this demo page.
Reddit
- reddit.com
- Has an unusually low LCP* compared to how long it takes for the page to become usable. Although not measured in this test, I generally find the page slow and sort of unusable on Intel Macbooks which are, by historical standards, extremely fast computers (unless I use old.reddit.com)
Mastodon
- mastodon.social/@danluu: used my profile where possible
- Fails to load on Itel P32, just gives you a blank screen. Due to how long things generally take on the Itel P32, it's not obvious for a while if the page is failing or if it's just slow
Quora
- www.quora.com/Ever-felt-like-giving-up-on-your-dreams-How-did-you-come-out-of-it: I tried googling for quora + the username of a metafilter user who I've heard is now prolific on Quora. Rather than giving their profile page, Google returned this page, which appears to have nothing to do with the user I searched for. So, this isn't comparable to the social media profiles, but getting a random irrelevant Quora result from Google is how I tend to interact with Quora, so I guess this is representative of my Quora usage.
- On the Itel P32, the page stops executing scripts at some point and doesn't fully load. This causes it to fail to display properly. Interacting with the page doesn't really work either.
Substack
- Used thezvi.substack.com because I know Zvi has a substack and writes about similar topics.
vBulletin:
- forum.vbulletin.com: this is what turned up when I searched for an official forum.
Medium
- medium.com/swlh: I don't read anything on Medium, so I googled for programming blogs on Medium and this was the top hit. From looking at the theme, it doesn't appear to be unusually heavy or particularly customized for a Medium blog. Since it appears to be widely read and popular, it's more likely to be served from a CDN and than some of the other blogs here.
- On a run that wasn't a benchmark reference run, on the Itel P32, I tried scrolling starting 35s after loading the page. The delay to scroll was 5s-8s and scrolling moved an unpredictable amount, making the page completely unusable. This wasn't marked as a FAIL in the table, but one could argue that this should be a FAIL since the page is unusable.
Ghost
- source.ghost.io because this is the current default Ghost theme and it was the first example I found
Wordpress
- 2024.wordpress.net because this is the current default wordpress theme and this was the first example of it I found
XenForo
- xenforo.com/community/: this is what turned up when I searched for an official forum
- On the Itel P32, the layout is badly wrong and page content overlaps itself. There's no reasonable way to interact with the element you want because of this, and reading the text requires reading text that's been overprinted multiple times.
Wordpress (old)
- Used thezvi.wordpress.com because it has the same content as Zvi's substack, and happens to be on some old wordpress theme that used to be a very common choice
phpBB
- www.phpbb.com/community/index.php: this is what turned up when I searched for an official forum.
MyBB
- community.mybb.com: this is what turned up when I searched for an official forum.
- Site doesn't serve up a mobile version. In general, I find the desktop version of sites to be significantly better than the mobile version when on a slow device, so this works quite well, although they're likely penalized by Google for this.
HN
- news.ycombinator.com
- In principle, HN should be the slowest social media site or link aggregator because it's written in a custom Lisp that isn't highly optimized and the code was originally written with brevity and cleverness in mind, which generally gives you fairly poor performance. However, that's only poor relative to what you'd get if you were writing high-performance code, which is not a relevant point of comparison here.
danluu.com
- Self explanatory
- This currently uses a bit less CPU than HN, but I expect this to eventually use more CPU as the main page keeps growing. At the moment, this page has 176 links to 168 articles vs. HN's 199 links to 30 articles but, barring an untimely demise, this page should eventually have more links than HN.
  - As noted above, I find that pagination for such small pages makes the browsing experience much worse on slow devices or with bad connections, so I don't want to "optimize" this by paginating it or, even worse, doing some kind of dynamic content loading on scroll.
Woo Commerce
- I originally measured Woo Commerce as well but, unlike the pages and platforms tested above, I didn't find that being fast or slow on the initial load was necessarily representative of subsequent performance of other action, so this wasn't included in the table because having this in the table is sort of asking for a comparison against Shopify. In particular, while the "most default" Woo theme I could find was significantly faster than the "most default" Shopify theme on initial load on a slow device, performance was multidimensional enough that it was easy to find realistic scenarios where Shopify was faster than Woo and vice versa on a slow device, which is quite different from what I saw with newer blogging platforms like Substack and Medium compared to older platforms like Wordpress, or a modern forum like Discourse versus the older PHP-based forums. A real comparison of shopping sites that have carts, checkout flows, etc., would require a better understanding of real-world usage of these sites than I was going to get in a single day.
NodeBB
- community.nodebb.org
- This wasn't in my original tests and I only tried this out because one of the founders of NodeBB suggested it, saying "I am interested in seeing whether @nodebb@fosstodon.org would fare better in your testing. We spent quite a bit of time over the years on making it wicked fast, and I personally feel it is a better representation of modern forum software than Discourse, at least on speed and initial payload."
- I didn't do the full set of tests because I don't keep the Itel P32 charged (the battery is in rough shape and discharges quite quickly once unplugged, so I'd have to wait quite a while to get it into a charged state)
- On the tests I did, it got 0.3s/0.4s on the M1 and 3.4s/7.2s on the Tecno Spark 8C. This is moderately slower than vBulletin and significantly slower than the faster php forums, but much faster than Discourse. If you need a "modern" forum for some reason and want to have your forum be usable by people who aren't, by global standards, rich, this seems like it could work.
- Another notable thing, given that it's a "modern" site, is that interaction works fine after initial load; you can scroll and tap on things and this all basically works, nothing crashed, etc.
- Sizes were 0.9 MB / 2.2 MB, so also fairly light for a "modern" site and possibly usable on a slow connection, although slow connections weren't tested here.

Another kind of testing would be to try to configure pages to look as similar as possible. I'd be interested in seeing that results for that if anyone does it, but that test would be much more time consuming. For one thing, it requires customizing each site. And for another, it requires deciding what sites should look like. If you test something danluu.com-like, every platform that lets you serve up something light straight out of a CDN, like Wordpress and Ghost, should score similarly, with the score being dependent on the CDN and the CDN cache hit rate. Sites like Medium and Substack, which have relatively little customizability would score pretty much as they do here. Realistically, from looking at what sites exist, most users will create sites that are slower than the "most default" themes for Wordpress and Ghost, although it's plausible that readers of this blog would, on average, do the opposite, so you'd probably want to test a variety of different site styles.

Appendix: this site vs. sites that don't work on slow devices or slow connections

Just as an aside, something I've found funny for a long time is that I get quite a bit of hate mail about the styling on this page (and a similar volume of appreciation mail). By hate mail, I don't mean polite suggestions to change things, I mean the equivalent of road rage, but for web browsing; web rage. I know people who run sites that are complex enough that they're unusable by a significant fraction of people in the world. How come people are so incensed about the styling of this site and, proportionally, basically don't care at all that the web is unusable for so many people?

Another funny thing here is that the people who appreciate the styling generally appreciate that the site doesn't override any kind of default styling, letting you make the width exactly what you want (by setting your window size how you want it) and it also doesn't override any kind of default styling you apply to sites. The people who are really insistent about this want everyone to have some width limit they prefer, some font they prefer, etc., but it's always framed in a way as if they don't want it, it's really for the benefit of people at large even though accommodating the preferences of the web ragers would directly oppose the preferences of people who prefer (just for example) to be able to adjust the text width by adjusting their window width.

Until I pointed this out tens of times, this iteration would usually start with web ragers telling me that "studies show" that narrower text width is objectively better, but on reading every study that exists on the topic that I could find, I didn't find this to be the case. Moreover, on asking for citations, it's clear that people saying this generally hadn't read any studies on this at all and would sometimes hastily send me a study that they did not seem to have read. When I'd point this out, people would then change their argument to how studies can't really describe the issue (odd that they'd cite studies in the first place), although one person cited a book to me (which I read and they, apparently, had not since it also didn't support their argument) and then move to how this is what everyone wants, even though that's clearly not the case, both from the comments I've gotten as well as the data I have from when I made the change.

Web ragers who have this line of reasoning generally can't seem to absorb the information that their preferences are not universal and will insist that they regardless of what people say they like, which I find fairly interesting. On the data, when I switched from Octopress styling (at the time, the most popular styling for programming bloggers) to the current styling, I got what appeared to be a causal increase in traffic and engagement, so it appears that not only do people who write me appreciation mail about the styling like the styling, the overall feeling of people who don't write to me appears to be that the site is fine and apparently more appealing than standard programmer blog styling. When I've noted this, people tend to become become further invested in the idea that their preferences are universal and that people who think they have other preferences are wrong and reply with total nonsense.

For me, two questions I'm curious about are why do people feel the need to fabricate evidence on this topic (referring to studies when they haven't read any, googling for studies and then linking to one that says the opposite of what they claim it says, presumably because they didn't really read it, etc.) in order to claim that there are "objective" reasons their preferences are universal or correct, and why are people so much more incensed by this than by the global accessibility problems caused by typical web design? On the latter, I suspect if you polled people with an abstract survey, they would rate global accessibility to be a larger problem, but by revealed preference both in terms of what people create as well as what irritates them enough to send hate mail, we can see that having fully-adjustable line width and not capping line width at their preferred length is important to do something about whereas global accessibility is not. As noted above, people who run sites that aren't accessible due to performance problems generally get little to no hate mail about this. And when I use a default Octopress install, I got zero hate mail about this. Fewer people read my site at the time, but my traffic volume hasn't increased by a huge amount since then and the amount of hate mail I get about my site design has gone from zero to a fair amount, an infinitely higher ratio than the increase in traffic.

To be clear, I certainly wouldn't claim that the design on this site is optimal. I just removed the CSS from the most popular blogging platform for programmers at the time because that CSS seemed objectively bad for people with low-end connections and, as a side effect, got more traffic and engagement overall, not just from locations where people tend to have lower end connections and devices. No doubt a designer who cares about users on low-end connections and devices could do better, but there's something quite odd about both the untruthfulness and the vitriol of comments on this.

This estimate puts backwards-looking life expectancy in the low 60s; that paper also discusses other estimates in the mid 60s and discusses biases in the estimates. ^[return]

Diseconomies of scale in fraud, spam, support, and moderation

Sun, 18 Feb 2024 00:00:00 +0000

If I ask myself a question like "I'd like to buy an SD card; who do I trust to sell me a real SD card and not some fake, Amazon or my local Best Buy?", of course the answer is that I trust my local Best Buy¹ more than Amazon, which is notorious for selling counterfeit SD cards. And if I ask who do I trust more, my local reputable electronics shop (Memory Express, B&H Photo, etc.), I trust my local reputable electronics shop more. Not only are they less likely to sell me a counterfeit than Best Buy, in the event that they do sell me a counterfeit, the service is likely to be better.

Similarly, let's say I ask myself a question like, "on which platform do I get a higher rate of scams, spam, fraudulent content, etc., [smaller platform] or [larger platform]"? Generally the answer is [larger platform]. Of course, there are more total small platforms out there and they're higher variance, so I could deliberately use a smaller platform that's worse, but I'm choosing good options instead of bad options, in every size class, the smaller platform is generally better. For example, with Signal vs. WhatsApp, I've literally never received a spam Signal message, whereas I get spam WhatsApp messages somewhat regularly. Or if I compare places I might read tech content on, if I compare tiny forums no one's heard of to lobste.rs, lobste.rs has a very slightly higher rate (rate as in fraction of messages I see, not absolute message volume) of bad content because it's zero on the private forums and very low but non-zero on lobste.rs. And then if I compare lobste.rs to a somewhat larger platform, like Hacker News or mastodon.social, those have (again very slightly) higher rates of scam/spam/fraudulent content. And then if I compare that to mid-sized social media platforms, like reddit, reddit has a significantly higher and noticeable rate of bad content. And then if I can compare reddit to the huge platforms like YouTube, Facebook, Google search results, these larger platforms have an even higher rate of scams/spam/fraudulent content. And, as with the SD card example, the odds of getting decent support go down as the platform size goes up as well. In the event of an incorrect suspension or ban from the platform, the odds of an account getting reinstated get worse as the platform gets larger.

I don't think it's controversial to say that in general, a lot of things get worse as platforms get bigger. For example, when I ran a Twitter poll to see what people I'm loosely connected to think, only 2.6% thought that huge company platforms have the best moderation and spam/fraud filtering. For reference, in one poll, 9% of Americans said that vaccines implant a microchip and and 12% said the moon landing was fake. These are different populations but it seems random Americans are more likely to say that the moon landing was faked than tech people are likely to say that the largest companies have the best anti-fraud/anti-spam/moderation.

However, over the past five years, I've noticed an increasingly large number of people make the opposite claim, that only large companies can do decent moderation, spam filtering, fraud (and counterfeit) detection, etc. We looked at one example of this when we examined search results, where a Google engineer said

Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.

And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?

And a thought leader responded

like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!

But when we looked at the actual results, it turned out that, of the search engines we looked at, Mr 0.0001% Market Share was the most resistant to SEO abuse (and fairly good), Mr 0.001% was a bit resistant to SEO abuse, and Google and Bing were just flooded with SEO abuse, frequently funneling people directly to various kinds of scams. Something similar happens with email, where I commonly hear that it's impossible to manage your own email due to the spam burden, but people do it all the time and often have similar or better results than Gmail, with the main problem being interacting with big company mail servers which incorrectly ban their little email server.

I started seeing a lot of comments claiming that you need scale to do moderation, anti-spam, anti-fraud, etc., around the time Zuckerberg, in response to Elizabeth Warren calling for the breakup of big tech companies, claimed that breaking up tech companies would make content moderation issues substantially worse, saying:

It’s just that breaking up these companies, whether it’s Facebook or Google or Amazon, is not actually going to solve the issues,” Zuckerberg said “And, you know, it doesn’t make election interference less likely. It makes it more likely because now the companies can’t coordinate and work together. It doesn’t make any of the hate speech or issues like that less likely. It makes it more likely because now ... all the processes that we’re putting in place and investing in, now we’re more fragmented

It’s why Twitter can’t do as good of a job as we can. I mean, they face, qualitatively, the same types of issues. But they can’t put in the investment. Our investment on safety is bigger than the whole revenue of their company. [laughter] And yeah, we’re operating on a bigger scale, but it’s not like they face qualitatively different questions. They have all the same types of issues that we do."

The argument is that you need a lot of resources to do good moderation and smaller companies, Twitter sized companies (worth ~$30B at the time), can't marshal the necessary resources to do good moderation. I found this statement quite funny at the time because, pre-Twitter acquisition, I saw a much higher rate of obvious scam content on Facebook than on Twitter. For example, when I clicked through Facebook ads during holiday shopping season, most were scams and, while Twitter had its share of scam ads, it wasn't really in the same league as Facebook. And it's not just me — Arturo Bejar, who designed an early version of Facebook's reporting system and headed up some major trust and safety efforts noticed something similar (see footnote for details)².

Zuckerberg seems to like the line of reasoning mentioned above, though, as he's made similar arguments elsewhere, such as here, in a statement the same year that Meta's internal docs made the case that they were exposing 100k minors a day to sexual abuse imagery:

To some degree when I was getting started in my dorm room, we obviously couldn’t have had 10,000 people or 40,000 people doing content moderation then and the AI capacity at that point just didn’t exist to go proactively find a lot of harmful content. At some point along the way, it started to become possible to do more of that as we became a bigger business

The rhetorical sleight of hand here is the assumption that Facebook needed 10k or 40k people doing content moderation when Facebook was getting started in Zuckerberg's dorm room. Services that are larger than dorm-room-Facebook can and do have better moderation than Facebook today with a single moderator, often one who works part time. But as people talk more about pursuing real antitrust action against big tech companies, tech big tech founders and execs have ramped up the anti-antitrust rhetoric, making claims about all sorts of disasters that will befall humanity if the biggest companies are broken up into the size of the biggest tech companies of 2015 or 2010. This kind of reasoning seems to be catching on a bit, as I've seen more and more big company employees state very similar reasoning. We've come a long way since the 1979 IBM training manual which read

A COMPUTER CAN NEVER BE HELD ACCOUNTABLE

THEREFORE A COMPUTER MUST NEVER MAKE A MANAGEMENT DECISION

The argument is now, for many critical decisions, it is only computers that can make most of the decisions and the lack of accountability seems to ultimately a feature, not a bug.

But unfortunately for Zuckerberg's argument³, there are at least three major issues in play here where diseconomies of scale dominate. One is that, given material that nearly everyone can agree is bad (such as bitcoin scams, spam for fake pharmaceutical products, fake weather forecasts, adults sending photos of their genitals to children), etc., large platforms do worse than small ones. The second is that, for the user, errors are much more costly and less fixable as companies get bigger because support generally becomes worse. The third is that, as platforms scale up, a larger fraction of users will strongly disagree about what should be allowed on the platform.

With respect to the first, while it's true that big companies have more resources, the cocktail party idea that they'll have the best moderation because they have the most resources is countered by the equally simplistic idea that they'll have the worst moderation because they're the juiciest targets or that they'll have the worst moderation because they'll have worst fragmentation due to the standard diseconomies of scale that occur when you scale up organizations and problem domains. Whether or not the company having more resources or these other factors dominate is too complex to resolve theoretically, but can observe the result empirically. At least at the level of resources that big companies choose to devote to moderation, spam, etc., having the larger target and other problems associated with scale dominate.

While it's true that these companies are wildly profitable and could devote enough resources to significantly reduce this problem, they have chosen not to do this. For example, in the last year before I wrote this sentence, Meta's last-year profit before tax (through December 2023) was $47B. If Meta had a version of the internal vision statement of a power company a friend mine worked for ("Reliable energy, at low cost, for generations.") and operated like that power company did, trying to create a good experience for the user instead of maximizing profit plus creating the metaverse, they could've spent the $50B they spent on the metaverse on moderation platforms and technology and then spent $30k/yr (which would result in a very good income in most countries where moderators are hired today, allowing them to have their pick of who to hire) on 1.6 million additional full-time staffers for things like escalations and support, on the order of one additional moderator or support staffer per few thousand users (and of course diseconomies of scale apply to managing this many people). I'm not saying that Meta or Google should do this, just that whenever someone at big tech company says something like "these systems have to be fully automated because no one could afford to operate manual systems at our scale", what's really being said is more along the lines of "we would not be able to generate as many billions a year in profit if we hired enough competent people to manually review cases our system should flag as ambiguous, so we settle for what we can get without compromising profits".⁴ One can defend that choice, but it is a choice.

And likewise for claims about advantages of economies of scale. There are areas where economies of scale legitimately make the experience better for users. For example, when we looked at why it's so hard to buy things that work well, we noted that Amazon's economies of scale have enabled them to build out their own package delivery service that is, while flawed, still more reliable than is otherwise available (and this has only improved since they added the ability for users to rate each delivery, which no other major package delivery service has). Similarly, Apple's scale and vertical integration has allowed them to build one of the all-time great performance teams (as measured by normalized performance relative to competitors of the same era), not only wiping the floor with the competition on benchmarks, but also providing a better experience in ways that no one really measured until recently, like device latency. For a more mundane example of economies of scale, crackers and other food that ships well are cheaper on Amazon than in my local grocery store. It's easy to name ways in which economies of scale benefit the user, but this doesn't mean that we should assume that economies of scale dominate diseconomies of scale in all areas. Although it's beyond the scope of this post, if we're going to talk about whether or not users are better off if companies are larger or smaller, we should look at what gets better when companies get bigger and what gets worse, not just assume that everything will get better just because some things get better (or vice versa).

Coming back to the argument that huge companies have the most resources to spend on moderation, spam, anti-fraud, etc., vs. the reality that they choose to spend those resources elsewhere, like dropping $50B on the Metaverse and not hiring 1.6 million moderators and support staff that they could afford to hire, it makes sense to look at how much effort is being expended. Meta's involvement in Myanmar makes for a nice case study because Erin Kissane wrote up a fairly detailed 40,000 word account of what happened. The entirety of what happened is a large and complicated issue (see appendix for more discussion) but, for the main topic of this post, the key components are that there was an issue that most people can generally agree should be among the highest priority moderation and support issues and that, despite repeated, extremely severe and urgent, warnings to Meta staff at various levels (engineers, directors, VPs, execs, etc.), almost no resources were dedicated to the issue while internal documents indicate that only a small fraction of agreed-upon bad content was caught by their systems (on the order of a few percent). I don't think this is unique to Meta and this matches my experience with other large tech companies, both as a user of their products and as an employee.

To pick a smaller scale example, an acquaintance of mine had their Facebook account compromised and it's now being used for bitcoin scams. The person's name is Samantha K. and some scammer is doing enough scamming that they didn't even bother reading her name properly and have been generating very obviously faked photos where someone holds up a sign and explains how "Kamantha" has helped them make tens or hundreds of thousands of dollars. This is a fairly common move for "hackers" to make and someone else I'm connected to on FB reported that this happened to their account and they haven't been able to recover the old account or even get it banned despite the constant stream of obvious scams being posted by the account.

By comparison, on lobste.rs, I've never seen a scam like this and Peter Bhat Harkins, the head mod says that they've never had one that he knows of. On Mastodon, I think I might've seen one once in my feed, replies, or mentions. Of course, Mastodon is big enough that you can find some scams if you go looking for them, but the per-message and per-user rates are low enough that you shouldn't encounter them as a normal user. On Twitter (before the acquisition) or reddit, moderately frequently, perhaps an average of once every few weeks in my normal feed. On Facebook, I see things like this all the time; I get obvious scam consumer good sites every shopping season, and the bitcoin scams, both from ads as well as account takeovers, are year-round. Many people have noted that they don't bother reporting these kinds of scams anymore because they've observed that Facebook doesn't take action on their reports. Meanwhile, Reuven Lerner was banned from running Facebook ads on their courses about Python and Pandas, seemingly because Facebook systems "thought" that Reuven was advertising something to do with animal trading (as opposed to programming). This is the fidelity of moderation and spam control that Zuckerberg says cannot be matched by any smaller company. By the way, I don't mean to pick on Meta in particular; if you'd like examples with a slightly different flavor, you can see the appendix of Google examples for a hundred examples of automated systems going awry at Google.

A reason this comes back to being an empirical question is that all of this talk about how economies of scale allows huge companies to bring more resources to bear on the problem on matters if the company chooses to deploy those resources. There's no theoretical force that makes companies deploy resources in these areas, so we can't reason theoretically. But we can observe that the resources deployed aren't sufficient to match the problems, even in cases where people would generally agree that the problem should very obviously be high priority, such as with Meta in Myanmar. Of course, when it comes to issues where the priority is less obvious, resources are also not deployed there.

On the second issue, support, it's a meme among tech folks that the only way to get support as a user of one of the big platforms is to make a viral social media post or know someone on the inside. This compounds the issue of bad moderation, scam detection, anti-fraud, etc., since those issues could be mitigated if support was good.

Normal support channels are a joke, where you either get a generic form letter rejection, or a kafkaesque nightmare followed by a form letter rejection. For example, when Adrian Black was banned from YouTube for impersonating Adrian Black (to be clear, he was banned for impersonating himself, not someone else with the same name), after appealing, he got a response that read

unfortunately, there's not more we can do on our end. your account suspension & appeal were very carefully reviewed & the decision is final

In another Google support story, Simon Weber got the runaround from Google support when he was trying to get information he needed to pay his taxes

accounting data exports for extensions have been broken for me (and I think all extension merchants?) since April 2018 [this was written on Sept 2020]. I had to get the NY attorney general to write them a letter before they would actually respond to my support requests so that I could properly file my taxes

There was also the time YouTube kept demonetizing PointCrow's video of eating water with chopsticks (he repeatedly dips chopsticks into water and then drinks the water, very slowly eating a bowl of water).

Despite responding with things like

we're so sorry about that mistake & the back and fourth [sic], we've talked to the team to ensure it doesn't happen again

He would get demonetized again and appeals would start with the standard support response strategy of saying that they took great care in examining the violating under discussion but, unfortunately, the user clearly violated the policy and therefore nothing can be done:

We have reviewed your appeal ... We reviewed your content carefully, and have confirmed that it violates our violent or graphic content policy ... it's our job to make sure that YouTube is a safe place for all

These are high-profile examples, but of course having a low profile doesn't stop you from getting banned and getting the same basically canned response, like this HN user who was banned for selling a vacuum in FB marketplace. After a number of appeals, he was told

Unfortunately, your account cannot be reinstated due to violating community guidelines. The review is final

When paid support is optional, people often say you won't have these problems if you pay for support, but people who use Google One paid support or Facebook and Instagram's paid creator support generally report that the paid support is no better than the free support. Products that effectively have paid support built-in aren't necessarily better, either. I know people who've gotten the same kind of runaround you get from free Google support with Google Cloud, even when they're working for companies that have 8 or 9 figure a year Google Cloud spend. In one of many examples, the user was seeing that Google must've been dropping packets and Google support kept insisting that the drops were happening in the customer's datacenter despite packet traces showing that this could not possibly be the case. The last I heard, they gave up on that one, but sometimes when an issue is a total showstopper, someone will call up a buddy of theirs at Google to get support because the standard support is often completely ineffective. And this isn't unique to Google — at another cloud vendor, a former colleague of mine was in the room for a conversation where a very senior engineer was asked to look into an issue where a customer was complaining that they were seeing 100% of packets get dropped for a few seconds at a time, multiple times an hour. The engineer responded with something like "it's the cloud, they should deal with it", before being told they couldn't ignore the issue as usual because the issue was coming from [VIP customer] and it was interrupting [one of the world's largest televised sporting events]. That one got fixed, but, odds are, you aren't that important, even if you're paying hundreds of millions a year.

And of course this kind of support isn't unique to cloud vendors. For example, there was this time Stripe held $400k from a customer for over a month without explanation, and every request to support got a response that was as ridiculous as the ones we just looked at. The user availed themself of the only reliable Stripe support mechanism, posting to HN and hoping to hit #1 on the front page, which worked, although many commenters said made the usual comments like "Flagged because we are seeing a lot of these on HN, and they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories", with multiple people suggesting or insinuating that the user is doing something illicit or fraudulent, but it turned out that it was an error on Stripe's end, compounded by Stripe's big company support. At one point, the user notes

While I was writing my HN post I was also on chat with Stripe for over an hour. No new information. They were basically trying to shut down the chat with me until I sent them the HN story and showed that it was getting some traction. Then they started working on my issue again and trying to communicate with more people

And then the issue was fixed the next day.

Although, in principle, as companies become larger, they could leverage their economies of scale to deliver more efficient support, instead, they tend to use their economies of scale to deliver worse, but cheaper and more profitable support. For example, on Google Play store approval support, a Google employee notes:

a lot of that was outsourced to overseas which resulted in much slower response time. Here stateside we had a lot of metrics in place to fast response. Typically your app would get reviewed the same day. Not sure what it's like now but the managers were incompetent back then even so

And a former FB support person notes:

The big problem here is the division of labor. Those who spend the most time in the queues have the least input as to policy. Analysts are able to raise issues to QAs who can then raise them to Facebook FTEs. It can take months for issues to be addressed, if they are addressed at all. The worst part is that doing the common sense thing and implementing the spirit of the policy, rather than the letter, can have a negative effect on your quality score. I often think about how there were several months during my tenure when most photographs of mutilated animals were allowed on a platform without a warning screen due to a carelessly worded policy "clarification" and there was nothing we could do about it.

If you've ever wondered why your support person is responding nonsensically, sometimes it's the obvious reason that support has been outsourced to someone making $1/hr (when I looked up the standard rates for one country that a lot of support is outsourced to, a fairly standard rate works out to about $1/hr) who doesn't really speak your language and is reading from a flowchart without understanding anything about the system they're giving support for, but another, less obvious, reason is that the support person may be penalized and eventually fired if they take actions that make sense instead of following the nonsensical flowchart that's in front of them.

Coming back to the "they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories" comment, this is a sentiment I've commonly seen expressed by engineers at companies that mete out arbitrary and capricious bans. I'm sympathetic to how people get here. As I noted before I joined Twitter, commenting on public information

Turns out twitter is removing ~1M bots/day. Twitter only has ~300M MAU, making the error tolerance v. low. This seems like a really hard problem ... Gmail's spam filter gives me maybe 1 false positive per 1k correctly classified ham ... Regularly wiping the same fraction of real users in a service would be [bad].

It is actually true that, if you, an engineer, dig into the support queue at some giant company and look at people appealing bans, almost all of the appeals should be denied. But, my experience from having talked to engineers working on things like anti-fraud systems is that many, and perhaps most, round "almost all" to "all", which is both quantitatively and qualitatively different. Having engineers who work on these systems believe that "all" and not "almost all" of their decisions are correct results in bad experiences for users.

For example, there's a social media company that's famous for incorrectly banning users (at least 10% of people I know have lost an account due to incorrect bans and, if I search for a random person I don't know, there's a good chance I get multiple accounts for them, with some recent one that has a profile that reads "used to be @[some old account]", with no forward from the old account to the new one because they're now banned). When I ran into a senior engineer from the team that works on this stuff, I asked him why so many legitimate users get banned and he told me something like "that's not a problem, the real problem is that we don't ban enough accounts. Everyone who's banned deserves it, it's not worth listening to appeals or thinking about them". Of course it's true that most content on every public platform is bad content, spam, etc., so if you have any sort of signal at all on whether or not something is bad content, when you look at it, it's likely to be bad content. But this doesn't mean the converse, that almost no users are banned incorrectly, is true. And if senior people on the team that classifies which content is bad have the attitude that we shouldn't worry about false positives because almost all flagged content is bad, we'll end up with a system that has a large number of false positives. I later asked around to see what had ever been done to reduce false positives in the fraud detection systems and found out that there was no systematic attempt at tracking false positives at all, no way to count cases where employees filed internal tickets to override bad bans, etc.; At the meta level, there was some mechanism to decrease the false negative rate (e.g., someone sees bad content that isn't being caught then adds something to catch more bad content) but, without any sort of tracking of false positives, there was effectively no mechanism to decrease the false positive rate. It's no surprise that this meta system resulted in over 10% of people I know getting incorrect suspensions or bans. And, as Patrick McKenzie says, the optimal rate of false positives isn't zero. But when you have engineers who have the attitude that they've done enough legwork that false positives are impossible, it's basically guaranteed that the false positive rate is higher than optimal. When you combine this with normal big company levels of support, it's a recipe for kafkaesque user experiences.

Another time, I commented on how an announced change in Uber's moderation policy seemed likely to result in false positive bans. An Uber TL immediately took me to task, saying that I was making unwarranted assumptions on how banning works, that Uber engineers go to great lengths to make sure that there are no false positive bans, there's extensive to review to make sure that bans are valid and, in fact, the false positive banning I was concerned about could never happen. And then I got effectively banned due to a false positive in a fraud detection system. I was remind of that incident when Uber incorrectly banned a driver who had to take them to court to even get information on why he was banned, at which point Uber finally actually looked into it (instead of just responding to appeals with fake messages claiming they'd looked into it). Afterwards, Uber responded to a press inquiry with

We are disappointed that the court did not recognize the robust processes we have in place, including meaningful human review, when making a decision to deactivate a driver’s account due to suspected fraud

Of course, in that driver's case, there was no robust process for review, nor was there a robust appeals process for my case. When I contacted support, they didn't really read my message and made some change that broke my account even worse than before. Luckily, I have enough Twitter followers that some Uber engineers saw my tweet about the issue and got me unbanned, but that's not an option that's available to most people, leading to weird stuff like this Facebook ad targeted at Google employees, from someone desperately seeking help with their Google account.

And even when you know someone on the inside, it's not always easy to get the issue fixed because even if the company's effectiveness doesn't increase as the company gets bigger, the complexity of the systems does increase. A nice example of this is Gergely Orosz's story about when the manager of the payments team left Uber and then got banned from Uber due to some an inscrutable ML anti-fraud algorithm deciding that the former manager of the payments team was committing payments fraud. It took six months of trying to get the problem fixed to mitigate the issue. And, by the way, they never managed to understand what happened and fix the underlying issue; instead, they added the former manager of the payments team to a special whitelist, not fixing the issue for any other user and, presumably, severely reducing or perhaps even entirely removing payment fraud protections for the former manager's account.

No doubt they would've fixed the underlying issue if it were easy to, but as companies scale up, they produce both technical and non-technical bureaucracy that makes systems opaque even to employees.

Another example of that is, at a company that has a ranked social feed, the idea that you could eliminate stuff you didn't want in your ranked feed by adding filters for things like timeline_injection:false, interstitial_ad_op_out, etc., would go viral. The first time this happened, a number of engineers looked into it and thought that the viral tricks didn't work. They weren't 100% sure and were relying on ideas like "no one can recall a system that would do something like this ever being implemented" and "if you search the codebase for these strings, they don't appear", and "we looked at the systems we think might do this and they don't appear to do this". There was moderate confidence that this trick didn't work, but no one would state with certainty that the trick didn't work because, as at all large companies, the aggregate behavior of the system is beyond human understanding and even parts that could be understood often aren't because there are other priorities.

A few months later, the trick went viral again and people were generally referred to the last investigation when they asked if it was real, except that one person actually tried the trick and reported that it worked. They wrote a slack message about how the trick did work for them, but almost no one noticed that the one person who tried reproducing the trick found that it worked. Later, when the trick would go viral again, people would point to the discussions about how people thought the trick didn't work, with this message noting that it appears to work (almost certainly not by the mechanism that users think, and instead just because having a long list of filters causes something to time out, or something similar) basically got lost because there's too much information to read all of it.

In my social circles, many people have read James Scott's Seeing Like a State, which is subtitled How Certain Schemes to Improve the Human World Have Failed. A key concept from the book is "legibility", what a state can see, and how this distorts what states do. One could easily write a highly analogous book, Seeing like a Tech Company about what's illegible to companies that scale up, at least as companies are run today. A simple example of this is that, in many video games, including ones made by game studios that are part of a $3T company, it's easy to get someone suspended or banned by having a bunch of people report the account for bad behavior. What's legible to the game company is the rate of reports and what's not legible is the player's actual behavior (it could be legible, but the company chooses not to have enough people or skilled enough people examine actual behavior); and many people have reported similar bannings with social media companies. When it comes to things like anti-fraud systems, what's legible to the company tends to be fairly illegible to humans, even humans working on the anti-fraud systems themselves.

Although he wasn't specifically talking about an anti-fraud system, in a Special Master's System, Eugene Zarashaw, a director a Facebook made this comment which illustrates the illegibility of Facebook's own systems:

It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively

Facebook was unfairly and mostly ignorantly raked over the coals for this statement (we'll discuss that in an appendix), but it is generally true that it's difficult to understand how a system the size of Facebook works.

In principle, companies could augment the legibility of their inscrutable systems by having decently paid support people look into things that might be edge-case issues with severe consequences, where the system is "misunderstanding" what's happening but, in practice, companies pay these support people extremely poorly and hire people who really don't understand what's going on, and then give them instructions which ensure that they generally do not succeed at resolving legibility issues.

One thing that helps the forces of illegibility win at scale is that, as a highly-paid employee of one of these huge companies, it's easy to look at the millions or billions of people (and bots) out there and think of them all as numbers. As the saying goes, "the death of one man is a tragedy. The death of a million is a statistic" and, as we noted, engineers often turn thoughts like "almost all X is fraud" to "all X is fraud, so we might as well just ban everyone who does X and not look at appeals". The culture that modern tech companies have, of looking for scalable solutions at all costs, makes this worse than in other industries even at the same scale, and tech companies also have unprecedented scale.

For example, in response to someone noting that FB Ad Manager claims you can run an ad with a potential reach of 101M people in the U.S. aged 18-34 when the U.S. census had the total population of people aged 18-34 as 76M, the former PM of the ads targeting team responded with

Think at FB scale

And explained that you can't expect slice & dice queries to work for something like the 18-34 demographic in the U.S. at "FB scale". There's a meme at Google that's used ironically in cases like this, where people will say "I can't count that low". Here's the former PM of FB ads saying, non-ironically, "FB can't count that low" for numbers like 100M. Not only does FB not care about any individual user (unless they're famous), this PM claims they can't be bothered to care that groups of 100M people are tracked accurately.

Coming back to the consequences of poor support, a common response to hearing about people getting incorrectly banned from one of these huge services is "Good! Why would you want to use Uber/Amazon/whatever anyway? They're terrible and no one should use them". I disagree with this line of reasoning. For one thing, why should you decide for that person whether or not they should use a service or what's good for them? For another (and this this is a large enough topic that it should be its own post, so I'll just mention it briefly and link to this lengthier comment from @whitequark) most services that people write off as unnecessary conveniences that you should just do without are actually serious accessibility issues for quite a few people (in absolute, not necessarily, percentage, terms). When we're talking about small businesses, those people can often switch to another business, but with things like Uber and Amazon, there are sometimes zero or one alternatives that offer similar convenience and when there's one, getting banned due to some random system misfiring can happen with the other service as well. For example, in response to many people commenting on how you should just issue a chargeback and get banned from DoorDash when they don't deliver, a disabled user responds:

I'm disabled. Don't have a driver's license or a car. There isn't a bus stop near my apartment, I actually take paratransit to get to work, but I have to plan that a day ahead. Uber pulls the same shit, so I have to cycle through Uber, Door dash, and GrubHub based on who has coupons and hasn't stolen my money lately. Not everyone can just go pick something up.

Also, when talking about this class of issue, involvement is often not voluntary, such as in the case of this Fujitsu bug that incorrectly put people in prison.

On the third issue, the impossibility of getting people to agree on what constitutes spam, fraud, and other disallowed content, we discussed that in detail here. We saw that, even in a trivial case with a single, uncontroversial, simple, rule, people can't agree on what's allowed. And, as you add more rules or add topics that are controversial or scale up the number of people, it becomes even harder to agree on what should be allowed.

To recap, we looked at three areas where diseconomies of scale make moderation, support, anti-fraud, and anti-spam worse as companies get bigger. The first was that, even in cases where there's broad agreement that something is bad, such as fraud/scam/phishing websites and search, the largest companies with the most sophisticated machine learning can't actually keep up with a single (albeit very skilled) person working on a small search engine. The returns to scammers are much higher if they take on the biggest platforms, resulting in the anti-spam/anti-fraud/etc. problem being extremely non-linearly hard.

To get an idea of the difference in scale, HN "hellbans" spammers and people who post some kinds of vitriolic comments. Most spammers don't seem to realize they're hellbanned and will keep posting for a while, so if you browse the "newest" (submissions) page while logged in, you'll see a steady stream of automatically killed stories from these hellbanned users. While there are quite a few of them, the percentage is generally well under half. When we looked at a "mid-sized" big tech company like Twitter circa 2017, based on the public numbers, if spam bots were hellbanned instead of removed, spam is so much more prevalent that all you'd see if you were able to see it. And, as big companies go, 2017-Twitter isn't that big. As we also noted, the former PM of FB ads targeting explained that numbers as low as 100M are in the "I can't count that low" range, too small to care about; to him, basically a rounding error. The non-linear difference in difficulty is much worse for a company like FB or Google. The non-linearity of the difficulty of this problems is, apparently, more than a match for whatever ML or AI techniques Zuckerberg and other tech execs want to brag about.

In testimony in front of Congress, you'll see execs defend the effectiveness of these systems at scale with comments like "we can identify X with 95% accuracy", a statement that may technically be correct, but seems designed to deliberately mislead an audience that's presumed to be innumerate. If you use, as a frame of reference, things at a personal scale, 95% might sound quite good. Even for something like HN's scale, 95% accurate spam detection that results in an immediate ban might be sort of alright. Anyway, even if it's not great, people who get incorrectly banned can just email Dan Gackle, who will unban them. As we noted when we looked at the numbers, 95% accurate detection at Twitter's scale would be horrible (and, indeed, the majority of DMs I get are obvious spam). Either you have to back off and only ban users in cases where you're extremely confident, or you ban all your users after not too long and, as companies like to handle support, appealing means that you'll get a response saying that "your case was carefully reviewed and we have determined that you've violated our policies. This is final", even for cases where any sort of cursory review would cause a reversal of the ban, like when you ban a user for impersonating themselves. And then at FB's scale, it's even worse and you'll ban all of your users even more quickly, so then you back off and we end up with things like 100k minors a day being exposed to "photos of adult genitalia or other sexually abusive content".

The second area we looked at was support, which tends to get worse as companies get larger. At a high level, it's fair to say that companies don't care to provide decent support (with Amazon being somewhat of an exception here, especially with AWS, but even on the consumer side). Inside the system, there are individuals who care, but if you look at the fraction of resources expended on support vs. growth or even fun/prestige projects, support is an afterthought. Back when deepmind was training a StarCraft AI, it's plausible that Alphabet was spending more money playing Starcraft than on support agents (and, if not, just throw in one or two more big AI training projects and you'll be there, especially if you include the amortized cost of developing custom hardware, etc.).

It's easy to see how little big companies care. All you have to do is contact support and get connected to someone who's paid $1/hr to respond to you in a language they barely know, attempting to help solve a problem they don't understand by walking through some flowchart, or appeal an issue and get told "after careful review, we have determined that you have [done the opposite of what you actually did]". In some cases, you don't even need to get that far, like when following Instagram's support instructions results in an infinite loop that takes you back where you started and the "click here if this wasn't you link returns a 404". I've run into an infinite loop like this once, with Verizon, and it persisted for at least six months. I didn't check after that, but I'd bet on it persisting for years. If you had an onboarding or sign-up page that had an issue like this, that would be considered a serious bug that people should prioritize because that impacts growth. But for something like account loss due to scammers taking over accounts, that might get fixed after months or years. Or maybe not.

If you ever talk to people who work in support at a company that really cares about support, it's immediately obvious that they operate completely different from typical big tech company support, in terms of process as well as culture. Another way you can tell that big companies don't care about support is how often big company employees and execs who've never looked into how support is done or could be done will tell you that it's impossible to do better.

When you talk to people who work on support at companies that do actually care about this, it's apparent that it can be done much better. While I was writing this post, I actually did support at a company that does support decently well (for a tech company, adjusted for size, I'd say they're well above 99%-ile), including going through the training and onboarding process for support folks. Executing anything well at scale is non-trivial, so I don't mean to downplay how good their support org is, but the most striking thing to me was how much of the effectiveness of the org naturally followed from caring about providing a good support experience for the user. A full discussion of what that means is too long to include here, so we'll look at this in more detail another time, but one example is that, when we look at how big company support responds, it's often designed to discourage the user from responding ("this review is final") or to justify, putatively to the user, that the company is doing an adequate job ("this was not a purely automated process and each appeal was reviewed by humans in a robust process that ... "). This company's training instructs you to do the opposite of the standard big company "please go away"-style and "we did a great job and have a robust process, therefore complaints are invalid"-style responses. For every anti-pattern you commonly see in support, the training tells you to do the opposite and discusses why the anti-pattern results in a bad user experience. Moreover, the culture has deeply absorbed these ideas (or rather, these ideas come out of the culture) and there are processes for ensuring that people really know what it means to provide good support and follow through on it, support folks have ways to directly talk to the developers who are implementing the product, etc.

If people cared about doing good support, they could talk to people who work in support orgs that are good at helping users or even try working in one before explaining how it's impossible to do better, but this generally isn't done. Their company's support org leadership could do this as well, or do what I did and actually directly work in a support role in an effective support org, but this doesn't happen. If you're a cynic, this all makes sense. In the same way that cynics advise junior employees "big company HR isn't there to help you; their job is to protect the company", a cynic can credibly argue "big company support isn't there to help the user; their job is to protect the company", so of course big companies don't try to understand how companies that are good at supporting users do support because that's not what big company support is for.

The third area we looked at was how it's impossible for people to agree on how a platform should operate and how people's biases mean that people don't understand how difficult a problem this is. For Americans, a prominent case of this are the left and right wing conspiracy theories that pop up every time some bug pseudo-randomly causes any kind of service disruption or banning.

In a tweet, Ryan Greeberg joked:

Come work at Twitter, where your bugs TODAY can become conspiracy theories of TOMORROW!

In my social circles, people like to make fun of all of the absurd right-wing conspiracy theories that get passed around after some bug causes people to incorrectly get banned, causes the site not to load, etc., or even when some new ML feature correctly takes down a huge network of scam/spam bots, which also happens to reduce the follower count of some users. But of course this isn't unique to the right, and left-wing thought leaders and politicians come up with their own conspiracy theories as well.

Putting all three of these together, worse detection of issues, worse support, and a harder time reaching agreement on policies, we end with the situation we noted at the start where, in a poll of my Twitter followers, people who mostly work in tech and are generally fairly technically savvy, only 2.6% of people thought that the biggest companies were the best at moderation and spam/fraud filtering, so it might seem a bit silly to spend so much time belaboring the point. When you sample the U.S population at large, a larger fraction of people say they believe in conspiracy theories like vaccines putting a microchip in you or that we never landed on the moon, and I don't spend my time explaining why vaccines do not actually put a microchip in you or why it's reasonable to think that we landed on the moon. One reason that would perhaps be reasonable is that I've been watching the "only big companies can handle these issues" rhetoric with concern as it catches on among non-technical people, like regulators, lawmakers, and high-ranking government advisors, who often listen to and then regurgitate nonsense. Maybe next time you run into a lay person who tells you that only the largest companies could possibly handle these issues, you can politely point out that there's very strong consensus the other way among tech folks⁵.

If you're a founder or early-stage startup looking for an auth solution, PropelAuth is targeting your use case. Although they can handle other use cases, they're currently specifically trying to make life easier for pre-launch startups that haven't invested in an auth solution yet. Disclaimer: I'm an investor

Thanks to Gary Bernhardt, Peter Bhat Harkins, Laurence Tratt, Dan Gackle, Sophia Wisdom, David Turner, Yossi Kreinin, Justin Blank, Ben Cox, Horace He, @borzhemsky, Kevin Burke, Bert Muthalaly, Sasuke, anonymous, Zach Manson, Joachim Schipper, Tony D'Souza, and @GL1zdA for comments/corrections/discussion.

Appendix: techniques that only work at small scale

This post has focused on the disadvantages of bigness, but we can also flip this around and look at the advantages of smallness.

As mentioned, the best experiences I've had on platforms are a side effect of doing things that don't scale. One thing that can work well is to have a single person, with a single vision, handling the entire site or, when that's too big, a key feature of the site.

I'm on a number of small discords that have good discussion and essentially zero scams, spam, etc. The strategy for this is simple; the owner of the channel reads every message and bans and scammers or spammers who show up. When you get to a bigger site, like lobste.rs, or even bigger like HN, that's too large for someone to read every message (well, this could be done for lobste.rs, but considering that it's a spare-time pursuit for the owner and the volume of messages, it's not reasonable to expect them to read every message in a short timeframe), but there's still a single person who provides the vision for what should happen, even if the sites are large enough that it's not reasonable to literally read every message. The "no vehicles in the park" problem doesn't apply here because a person decides what the policies should be. You might not like those policies, but you're welcome to find another small forum or start your own (and this is actually how lobste.rs got started — under HN's previous moderation regime, which was known for banning people who disagreed with them, Joshua Stein was banned for publicly disagreeing with an HN policy, so Joshua created lobsters (and then eventually handed it off to Peter Bhat Harkins).

There's also this story about craigslist in the early days, as it was just getting big enough to have a serious scam and spam problem

... we were stuck at SFO for something like four hours and getting to spend half a workday sitting next to Craig Newmark was pretty awesome.

I'd heard Craig say in interviews that he was basically just "head of customer service" for Craigslist but I always thought that was a throwaway self-deprecating joke. Like if you ran into Larry Page at Google and he claimed to just be the janitor or guy that picks out the free cereal at Google instead of the cofounder. But sitting next to him, I got a whole new appreciation for what he does. He was going through emails in his inbox, then responding to questions in the craigslist forums, and hopping onto his cellphone about once every ten minutes. Calls were quick and to the point "Hi, this is Craig Newmark from craigslist.org. We are having problems with a customer of your ISP and would like to discuss how we can remedy their bad behavior in our real estate forums". He was literally chasing down forum spammers one by one, sometimes taking five minutes per problem, sometimes it seemed to take half an hour to get spammers dealt with. He was totally engrossed in his work, looking up IP addresses, answering questions best he could, and doing the kind of thankless work I'd never seen anyone else do with so much enthusiasm. By the time we got on our flight he had to shut down and it felt like his giant pile of work got slightly smaller but he was looking forward to attacking it again when we landed.

At some point, if sites grow, they get big enough that a person can't really own every feature and every moderation action on the site, but sites can still get significant value out of having a single person own something that people would normally think is automated. A famous example of this is how the Digg "algorithm" was basically one person:

What made Digg work really was one guy who was a machine. He would vet all the stories, infiltrate all the SEO networks, and basically keep subverting them to keep the Digg front-page usable. Digg had an algorithm, but it was basically just a simple algorithm that helped this one dude 10x his productivity and keep the quality up.

Google came to buy Digg, but figured out that really it's just a dude who works 22 hours a day that keeps the quality up, and all that talk of an algorithm was smoke and mirrors to trick the SEO guys into thinking it was something they could game (they could not, which is why front page was so high quality for so many years). Google walked.

Then the founders realised if they ever wanted to get any serious money out of this thing, they had to fix that. So they developed "real algorithms" that independently attempted to do what this one dude was doing, to surface good/interesting content.

...

It was a total shit-show ... The algorithm to figure out what's cool and what isn't wasn't as good as the dude who worked 22 hours a day, and without his very heavy input, it just basically rehashed all the shit that was popular somewhere else a few days earlier ... Instead of taking this massive slap to the face constructively, the founders doubled-down. And now here we are.

...

Who I am referring to was named Amar (his name is common enough I don't think I'm outing him). He was the SEO whisperer and "algorithm." He was literally like a spy. He would infiltrate the awful groups trying to game the front page and trick them into giving him enough info that he could identify their campaigns early, and kill them. All the while pretending to be an SEO loser like them.

Etsy supposedly used the same strategy as well.

Another class of advantage that small sites have over large ones is that the small site usually doesn't care about being large and can do things that you wouldn't do if you wanted to grow. For example, consider these two comments made in the midst of a large flamewar on HN

My wife spent years on Twitter embroiled in a very long running and bitter political / rights issue. She was always thoughtful, insightful etc. She'd spend 10 minutes rewording a single tweet to make sure it got the real point across in a way that wasn't inflammatory, and that had a good chance of being persuasive. With 5k followers, I think her most popular tweets might get a few hundred likes. The one time she got drunk and angry, she got thousands of supportive reactions, and her followers increased by a large % overnight. And that scared her. She saw the way "the crowd" was pushing her. Rewarding her for the smell of blood in the water.

I've turned off both the flags and flamewar detector on this article now, in keeping with the first rule of HN moderation, which is (I'm repeating myself but it's probably worth repeating) that we moderate HN less, not more, when YC or a YC-funded startup is part of a story ... Normally we would never late a ragestorm like this stay on the front page—there's zero intellectual curiosity here, as the comments demonstrate. This kind of thing is obviously off topic for HN: https://news.ycombinator.com/newsguidelines.html. If it weren't, the site would consist of little else. Equally obvious is that this is why HN users are flagging the story. They're not doing anything different than they normally would.

For a social media site, low-quality high-engagement flamebait is one of the main pillars that drive growth. HN, which cares more about discussion quality than growth, tries to detect and suppress these (with exceptions like criticism of HN itself, of YC companies like Stripe, etc., to ensure a lack of bias). Any social media site that aims to grow does the opposite; they implement a ranked feed that puts the content that is most enraging and most engaging in front of the people its algorithms predict will be the most enraged and engaged by it. For example, let's say you're in a country with very high racial/religious/factonal tensions, with regular calls for violence, etc. What's the most engaging content? Well, that would be content calling for the death of your enemies, so you get things a livestream of someone calling for the death of the other faction and then grabbing someone and beating them shown to a lot of people. After all, what's more engaging than a beatdown of your sworn enemy? A theme of Broken Code is that someone will find some harmful content they want to suppress, but then get overruled because that would reduce engagement and growth. HN has no such goal, so it has no problem suppressing or eliminating content that HN deems to be harmful.

Another thing you can do if growth isn't your primary goal is to deliberately make user-signups high friction. HN adds does a little bit of this by having a "login" link but not a "sign up" link, and sites like lobste.rs and metafilter do even more of this.

Appendix: Theory vs. practice

In the main doc, we noted that big company employees often say that it's impossible to provide better support for theoretical reason X, without ever actually looking into how one provides support or what companies that provide good support do. When the now-$1T were the size where many companies do provide good support, these companies also did not provide good support, so this doesn't seem to come from size since these huge companies didn't even attempt to provide good support, then or now. This theoretical, plausible sounding, reason doesn't really hold up in practice.

This is generally the case for theoretical discussions on disceconomies of scale of large tech companies. Another example is an idea mentioned at the start of this doc, that being a larger target has a larger impact than having more sophisticated ML. A standard extension of this idea that I frequently hear is that big companies actually do have the best anti-spam and anti-fraud, but they're also subject to the most sophisticated attacks. I've seen this used as a justification for why big companies seem to have worst anti-spam and anti-fraud than a forum like HN. While it's likely true that big companies are subject to the most sophisticated attacks, if this whole idea held and it were the case that their systems were really good, it would be harder, in absolute terms, to spam or scam people on reddit and Facebook than on HN, but that's not the case at all.

If you actually try to spam, it's extremely easy to do so on large platforms and the most obvious things you might try will often work. As an experiment, I made a new reddit account and tried to get nonsense onto the front page and found this completely trivial. Similarly it's completely trivial to take over someone's Facebook account and post obvious scams for months to years, with extremely markers that they're scams, many people replying in concern that the account has been taken over and is running scams (unlike working in support and spamming reddit, I didn't try taking over people's Facebook accounts, but given people's password practices, it's very easy to take over an account, and given how Facebook responds to these takeovers when a friend's account is taken over, we can see that attacks that do the most naive thing possible, with zero sophistication, are not defeated), etc. In absolute terms, it's actually more difficult to get spammy or scammy content in front of eyeballs on HN than it is on reddit or Facebook.

The theoretical reason here is one that would be significant if large companies were even remotely close to doing the kind of job they could do with the resources they have, but we're not even close to being there.

To avoid belaboring the point in this already very long document, I've only listed a couple of examples here, but I find this pattern to hold true of almost every counterargument I've heard on this topic. If you actually look into it a bit, these theoretical arguments are classic cocktail party ideas that have little to no connection to reality.

A meta point here is that you absolutely cannot trust vaguely plausible sounding arguments from people on this since they virtually all of them fall apart when examined in practice. It seems quite reasonable to think that a business the size of reddit would have more sophisticated anti-spam systems than HN, which has a single person who both writes the code for the anti-spam systems and does the moderation. But the most naive and simplistic tricks you might use to put content on the front page work on reddit and don't work on HN. I'm not saying you can't defeat HN's system, but doing so would take a little bit of thought, which is not the case for reddit and Facebook. And likewise for support, where once you start talking to people about how to run a support org that's good for users, you immediately see that the most obvious things have not been seriously tried by big tech companies.

Appendix: How much should we trust journalists' summaries of leaked documents?

Overall, very little. As we discussed when we looked at the Cruise pedestrian accident report, almost every time I read a journalist's take on something (with rare exceptions like Zeynep), the journalist has a spin they're trying to put on the story and the impression you get from reading the story is quite different from the impression you get if you look at the raw source; it's fairly common that there's so much spin that the story says the opposite of what the source docs say. That's one issue.

The full topic here is big enough that it deserves its own document, so we'll just look at two examples. The first is one we briefly looked at, when Eugene Zarashaw, a director at Facebook, testified in a Special Master’s Hearing. He said

It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively

Eugene's testimony resulted in headlines like , "Facebook Has No Idea What Is Going on With Your Data", "Facebook engineers admit there’s no way to track all the data it collects on you" (with a stock photo of an overwhelmed person in a nest of cables, grabbing their head) and "Facebook Engineers: We Have No Idea Where We Keep All Your Personal Data", etc.

Even without any technical knowledge, any unbiased person can plainly see that these headlines are inaccurate. There's a big difference between it taking work to figure out exactly where all data, direct and derived, for each user exists, and having no idea where the data is. If I Google, logged out with no cookies, Eugene Zarashaw facebook testimony, every single above the fold result I get is misleading, false, clickbait, like the above.

For most people with relevant technical knowledge, who understand the kind of systems being discussed, Eugene Zarashaw's quote is not only not egregious, it's mundane, expected, and reasonable.

Despite this lengthy disclaimer, there are a few reasons that I feel comfortable citing Jeff Horwitz's Broken Code as well as a few stories that cover similar ground. The first is that, if you delete all of the references to these accounts, the points in this doc don't really change, just like they wouldn't change if you delete 50% of the user stories mentioned here. The second is that, at least for me, the most key part is the attitudes on display and not the specific numbers. I've seen similar attitudes in companies I've worked for and heard about them inside companies where I'm well connected via my friends and I could substitute similar stories from my friends, but it's nice to be able to use already-public sources instead of using anonymized stories from my friends, so the quotes about attitude are really just a stand-in for other stories which I can verify. The third reason is a bit too subtle to describe here, so we'll look at that when I expand this disclaimer into a standalone document.

If you're looking for work, Freshpaint is hiring (US remote) in engineering, sales, and recruiting. Disclaimer: I may be biased since I'm an investor, but they seem to have found product-market fit and are rapidly growing.

Appendix: Erin Kissane on Meta in Myanmar

Erin starts with

But once I started to really dig in, what I learned was so much gnarlier and grosser and more devastating than what I’d assumed. The harms Meta passively and actively fueled destroyed or ended hundreds of thousands of lives that might have been yours or mine, but for accidents of birth. I say “hundreds of thousands” because “millions” sounds unbelievable, but by the end of my research I came to believe that the actual number is very, very large.

To make sense of it, I had to try to go back, reset my assumptions, and try build up a detailed, factual understanding of what happened in this one tiny slice of the world’s experience with Meta. The risks and harms in Myanmar—and their connection to Meta’s platform—are meticulously documented. And if you’re willing to spend time in the documents, it’s not that hard to piece together what happened. Even if you never read any further, know this: Facebook played what the lead investigator on the UN Human Rights Council’s Independent International Fact-Finding Mission on Myanmar (hereafter just “the UN Mission”) called a “determining role” in the bloody emergence of what would become the genocide of the Rohingya people in Myanmar.2

From far away, I think Meta’s role in the Rohingya crisis can feel blurry and debatable—it was content moderation fuckups, right? In a country they weren’t paying much attention to? Unethical and probably negligent, but come on, what tech company isn’t, at some point?

As discussed above, I have not looked into the details enough to determine if the claim that Facebook played a "determining role" in genocide are correct, but at a meta-level (no pun intended), it seems plausible. Every comment I've seen that aims to be a direction refutation of Erin's position is actually pre-refuted by Erin in Erin's text, so it appears that very few people who are publicly commenting who disagree with Erin read the articles before commenting (or they've read them and failed to understand what Erin is saying) and, instead, are disagreeing based on something other than the actual content. It reminds me a bit of the responses to David Jackson's proof of the four color theorem. Some people thought it was, finally, a proof, and others thought it wasn't.. Something I found interesting at the time was that the people who thought it wasn't a proof had read the paper and thought it seemed flawed, whereas the people who thought it was a proof were going off of signals like David's track record or the prestige of his institution. At the time, without having read the paper myself, I guessed (with low confidence) that the proof was incorrect based on the meta-heuristic that thoughts from people who read the paper were stronger evidence than things like prestige. Similarly, I would guess that Erin's summary is at least roughly accurate and that Erin's endorsement of the UN HRC fact-finding mission is correct, although I have lower confidence in this than in my guess about the proof because making a positive claim like this is harder than finding a flaw and the area is one where evaluating a claim is significantly trickier.

Unlike with Broken Code, the source documents are available here and it would be possible to retrace Erin's steps, but since there's quite a bit of source material and the claims that would need additional reading and analysis to really be convinced and those claims don't play a determining role in the correctness of this document, I'll leave that for somebody else.

On the topic itself, Erin noted that some people at Facebook, when presented with evidence that something bad was happening, laughed it off as they simply couldn't believe that Facebook could be instrumental in something that bad. Ironically, this is fairly similar in tone and content to a lot of the "refutations" of Erin's articles which appear to have not actually read the articles.

The most substantive objections I've seen are around the edges which, such as

The article claims that "Arturo Bejar" was "head of engineering at Facebook", which is simply false. He appears to have been a Director, which is a manager title overseeing (typically) less than 100 people. That isn't remotely close to "head of engineering".

What Erin actually said was

... Arturo Bejar, one of Facebook’s heads of engineering

So the objection is technically incorrect in that it was not said that Arturo Bejar was head of engineering. And, if you read the entire set of articles, you'll see references like "Susan Benesch, head of the Dangerous Speech Project" and "the head of Deloitte in Myanmar", so it appears that the reason that Erin said that "one of Facebook’s heads of engineering" is that Erin is using the term head colloquially here (and note that the it isn't capitalized, as a title might be), to mean that Arturo was in charge of something.

There is a form of the above objection that's technically correct — for an engineer at a big tech company, the term Head of Engineering will generally call to mind an executive who all engineers transitively report into (or, in cases where there are large pillars, perhaps one of a few such people). Someone who's fluent in internal tech company lingo would probably not use this phrasing, even when writing for lay people, but this isn't strong evidence of factual errors in the article even if, in an ideal world, journalists would be fluent in the domain-specific connotations of every phrase.

The person's objection continues with

I point this out because I think it calls into question some of the accuracy of how clearly the problem was communicated to relevant people at Facebook.

It isn't enough for someone to tell random engineers or Communications VPs about a complex social problem.

On the topic of this post, diseconomies of scale, this objection, if correct, actually supports the post. According to Arturo's LinkedIn, he was "the leader for Integrity and Care Facebook", and the book Broken Code discusses his role at length, which is very closely related to the topic of Meta in Myanmar. Arturo is not, in fact, a "random engineers or Communications VP".

Anway, Erin documents that Facebook was repeatedly warned about what was happening, for years. These warnings went well beyond the standard reporting of bad content and fake accounts (although those were also done), and included direct conversations with directors, VPs, and other leaders. These warnings were dismissed and it seems that people thought that their existing content moderation systems were good enough, even in the face of fairly strong evidence that this was not the case.

Reuters notes that one of the examples Schissler gives Meta was a Burmese Facebook Page called, “We will genocide all of the Muslims and feed them to the dogs.” 48

None of this seems to get through to the Meta employees on the line, who are interested in…cyberbullying. Frenkel and Kang write that the Meta employees on the call “believed that the same set of tools they used to stop a high school senior from intimidating an incoming freshman could be used to stop Buddhist monks in Myanmar.”49

Aela Callan later tells Wired that hate speech seemed to be a “low priority” for Facebook, and that the situation in Myanmar, “was seen as a connectivity opportunity rather than a big pressing problem.”50

The details make this sound worse than a small excerpt, so I recommend reading the entire thing, but with respect to the discussion about resources, a key issue is that even after Meta decided to take some kind of action, the result was:

As the Burmese civil society people in the private Facebook group finally learn, Facebook has a single Burmese-speaking moderator—a contractor based in Dublin—to review everything that comes in. The Burmese-language reporting tool is, as Htaike Htaike Aung and Victoire Rio put it in their timeline, “a road to nowhere."

Since this was 2014, it's not fair to say that Meta could've spent the $50B metaverse dollars and hired 1.6 million moderators, but in 2014, it was still the 4th largest tech company in the world, worth $217B, with a net profit of $3B/yr, Meta would've "only" been able to afford something like 100k moderators and support staff if paid at a globally very generous loaded cost of $30k/yr (e.g., Jacobin notes that Meta's Kenyan moderators are paid $2/hr and don't get benefits). Myanmar's share of the global population was 0.7% and, let's say that you consider a developing genocide to be low priority and don't think that additional resources should be deployed to prevent or stop it and want to allocate a standard moderation share, then we have "only" have capacity for 700 generously paid moderation and support staff for Myanmar.

On the other side of the fence, there actually were 700 people:

in the years before the coup, it already had an internal adversary in the military that ran a professionalized, Russia-trained online propaganda and deception operation that maxed out at about 700 people, working in shifts to manipulate the online landscape and shout down opposing points of view. It’s hard to imagine that this force has lessened now that the genocidaires are running the country.

These folks didn't have the vaunted technology that Zuckerberg says that smaller companies can't match, but it turns out you don't need billions of dollars of technology when it's 700 on 1 and the 1 is using tools that were developed for a different purpose.

As you'd expect if you've ever interacted with the reporting system for a huge tech company, from the outside, nothing people tried worked:

They report posts and never hear anything. They report posts that clearly call for violence and eventually hear back that they’re not against Facebook’s Community Standards. This is also true of the Rohingya refugees Amnesty International interviews in Bangladesh

In the 40,000 word summary, Erin also digs through whistleblower reports to find things like

…we’re deleting less than 5% of all of the hate speech posted to Facebook. This is actually an optimistic estimate—previous (and more rigorous) iterations of this estimation exercise have put it closer to 3%, and on V&I [violence and incitement] we’re deleting somewhere around 0.6%…we miss 95% of violating hate speech.

and

[W]e do not … have a model that captures even a majority of integrity harms, particularly in sensitive areas … We only take action against approximately 2% of the hate speech on the platform. Recent estimates suggest that unless there is a major change in strategy, it will be very difficult to improve this beyond 10-20% in the short-medium term

and

While Hate Speech is consistently ranked as one of the top abuse categories in the Afghanistan market, the action rate for Hate Speech is worryingly low at 0.23 per cent.

To be clear, I'm not saying that Facebook has a significantly worse rate of catching bad content than other platforms of similar or larger size. As we noted above, large tech companies often have fairly high false positive and false negative rates and have employees who dismiss concerns about this, saying that things are fine.

Appendix: elsewhere

Appendix: Moderation and filtering fails

Since I saw Zuck's statement about how only large companies (and the larger the better) can possibly do good moderation, anti-fraud, anti-spam, etc., I've been collecting links I run across when doing normal day-to-browsing of failures by large companies. If I deliberately looked for failures, I'd have a lot more. And, for some reason, some companies don't really trigger my radar for this so, for example, even though I see stories about AirBnB issues all the time, it didn't occur to me to collect them until I started writing this post, so there are only a few AirBnB fails here, even though they'd be up there with Uber in failure count if I actually recorded the links I saw.

These are so frequent that, out of eight draft readers, at least two draft readers ran into an issue while reading the draft of this doc. Peter Bhat Harkins reported:

Well, I received a keychron keyboard a few days ago. I ordered a used K1 v5 (Keychron does small, infrequent production runs so it was out of stock everywhere). I placed the order on KeyChron's official Amazon store, fulfilled by Amazon. After some examination, I've received a v4. It's the previous gen mechanical switch instead of the current optical switch. Someone apparently peeled off the sticker with the model and serial number and one key stabilizer is broken from wear, which strongly implies someone bought a v5 and returned a v4 they already owned. Apparently this is a common scam on Amazon now.

In the other case, an anonymous reader created a Gmail account to used as a shared account for them and their partner, so they could get shared emails from local services. I know a number of people who've done this and this usually works fine, but in their case, after they used this email to set up a few services, Google decided that their account was suspicious:

Verify your identity

We’ve detected unusual activity on the account you’re trying to access. To continue, please follow the instructions below.

Provide a phone number to continue. We’ll send a verification code you can use to sign in.

Providing the phone number they used to sign up for the account resulted in

This phone number has already been used too many times for verification.

For whatever reason, even though this number was provided at account creation, using this apparently illegal number didn't result in the account being banned until it had been used for a while and the email address had been used to sign up for some services. Luckily, these were local services by small companies, so this issue could be fixed by calling them up. I've seen something similar happen with services that don't require you to provide a phone number on sign-up, but then lock and effectively ban the account unless you provide a phone number later, but I've never seen a case where the provided phone number turned out to not work after a day or two. The message above can be read two ways, the other way being that the phone number was allowed but had just recently been used to receive too many verification codes but, in recent history, the phone number had only once been used to receive a code, and that was the verification code necessary to attach a (required) phone number to the account in the first place.

I also had a quality control failure from Amazon, when I ordered a 10 pack of Amazon Basics power strips and the first one I pulled out had its cable covered in solder. I wonder what sort of process could leave solder, likely lead-based solder (although I didn't test it) all over the outside of one of these and wonder if I need to wash every Amazon Basics electronics item I get if I don't want lead dust getting all over my apartment. And, of course, since this is constant, I had many spam emails get through Gmail's spam filter and hit my inbox, and multiple ham emails get filtered into spam, including the classic case where I emailed someone and their reply to me went to spam; from having talked to them about it previously, I have no doubt that most of my draft readers who use Gmail also had something similar happen to them and that this is so common they didn't even find it worth remarking on.

Anyway, below, in a few cases, I've mentioned when commenters blame the user even though the issue is clearly not the user's fault. I haven't done this even close to exhaustively, so the lack of such a comment from me shouldn't be read as the lack of the standard "the user must be at fault" response from people.

Google

"I had to get the NY attorney general to write them a letter before they would actually respond to my support requests so that I could properly file my taxes"
Google photo search for gorilla returns photos of black people, fixed after Twitter thread about this goes viral; 3 years later, there are stories in the press about how Google fixed this by blocking search results for the terms "gorilla", "chimp", "chimpanzee", and "monkey" and has not unblocked the terms
- On 2024-01-06, I tried uploading a photo of a gorilla and searching for gorilla, which returned no results both immediately after the upload as well as a few weeks later, so this still appears to be blocked?
Google suspends a YouTuber for impersonating themselves; on appeal YouTube says "unfortunate, there's not more we can do on our end. your account suspension & appeal were very carefully reviewed & the decision is final ... we really appreciate your understanding".
- Channel restored after viral Twitter thread makes it to the front page of HN.
Two different users report having their account locked out after moving; no recovery of account
Google closed the accounts of everyone who bought a phone and then sold it to a particular person who was buying phones, resulting in emails to their email address getting bounced, inability to auth to anything using Google sign-in, etc.; at least one user whose account was a recovery account for someone who bought and sold a phone also had their accounted closed; Dans Deals wrote this up and people's accounts were reinstated after the story went viral
Google Cloud reduces quota for user, causing an incident, and then won't increase it again
- User tries to find out what's going on and has this discussion:
  - GCP support: You exceeded the rate limit
  - User: We did 5000/10min. The quota was approved at 18k/min
  - GCP support: That's not the rate limit
  - User: What's the rate limit
  - GCP support: Not sure have to check with that team
- So it seems like GCP added some kind of internal rate limiting that's stricter than the user's approved quota?
- A commenter responds with "if you don’t buy support from GCP you have no support." and other users note that paying for support can also give you no support
Google accepts fake DMCA takedown requests even in cases that are very obviously fake
- An official Google comment on this is the standard response that there are robust processes for this "We have robust tools and processes in place to fight fraudulent takedown attempts, and we use a combination of automated and human review to detect signals of abuse – including tactics that are well-known to us like backdating. We provide extensive transparency and submit notices to Lumen about removal requests to hold requesters accountable. Sites can file counter notifications for us to re-review if they believe content has been removed from our results in error. We track networks of abuse and apply extra scrutiny to removal requests where appropriate, and we’ve taken legal action to fight bad actors abusing the DMCA"
Small business app creator has everything shut down pending "verification" of Google Pay
- Support did nothing and GCP refused to look into it until this story hit #1 on HN, at which point someone looked into it and fixed it
Lobbying group representing Google, Apple, etc., is able to insert the language they want directly into a right to repair bill, excluding many devices from the actual right to repair.
- "“We had every environmental group walking supporting this bill,” Fahy told Grist. “What hurt this bill is Big Tech was opposed to it.”"
File containing a single line with "1" in it restricted on Google Drive due to copyright infringement; appeal denied
- HN readers play around and find that files containing just "0" also get flagged for copyright violation
- issue fixed after viral Twitter thread
In 2016, Fark has ads disabled when a photograph of a clothed adult posted in 2010 is incorrectly flagged as child porn; appeals process takes 5 weeks
- Fark notes that they had similar problems in 2013 because an image was flagged as showing too much skin
Pixel 6 freezes when calling emergency services
- a user notes that they reported the issue almost 4 years before this complaint on an earlier Pixel and the issue was "escalated" but was still an issue ~8 months before the previous complaint
- A Google official account responded that the freeze was due to Microsoft Teams, but the user notes they've never used or even installed Microsoft Teams (there was an actual issue where Teams would block emergency calls, but that was not this user's issue)
Account locked and information sent to SFPD after father takes images of son's groin to send to doctor, causing an SFPD investigation; SFPD cleared the father of any wrongdoing, but Google "stands by its decision", doesn't unlock the account
- Google spokesperson says "We follow US law in defining what constitutes CSAM and use a combination of hash matching technology and artificial intelligence to identify it and remove it from our platforms,"
Google cloud suspends corporate account, causing outage; there was a billing bug and the owner of the account paid and was assured that their account wouldn't be suspended due to the bug, but that was false and the account got suspended anyway
- HN commenter suggests that "engineers that lack business experience" reach out to their account managers once they have significant spend; multiple people respond and say that they've done this and it didn't help at all
Company locked out of their own domain on Google Workspaces; support refused to fix this
Google cloud account suspended because someone stole the CC numbers for the corporate card and made a fraudulent adwords charge
Journalist's YouTube account incorrectly demonetized
- fixed after 7 months of appealing and a viral Twitter thread
Ads account suspended; an educated guess is that some ML fraud signals plus using a Brex card led to the suspension
- card works when paying for many other Google services
Person's credit card stops working with Google accounts after using it to pay on multiple accounts
- guessed to be due to an incorrect anti-fraud check
Ads account suspended for "suspicious payments" even though the same card is used for many other Google payments, which are not suspended
- after multiple appeals that fail, the former Google employee talks to internal contacts to get escalations, which also fail and the ads account stays suspended
Google Play account banned for no known reason
- the link Google provides to file the appeal can't be access with a banned account
- the user had two apps using one API, so it counted as two separate violations at once, so the account was banned for "multiple violations"
Google ads account for a small non-profit banned due to "unpaid balance"
- Balance reads $0.00 but appealing ban fails
Google ads account banned after account automatically switched to Japanese and then payment is made with an American card
Google sheet with public election information incorrectly removed for "phishing"
- restored after viral HN thread
User account disabled and photos, etc., lost with no information on why and no information for why the appeal was rejected
- ex-Google engineer unable to escalate to anyone who can restore account
10-year old YouTube channel with 120M views scheduled for deletion due copyright claims (no information provided to channel creator about what the copyright infringement was)
- channel eventually saved after Twitter thread went viral
FairEmail and Netguard app developer removes apps after giving up on fight with Google over whether or not FairEmail is Spyware
- app later restored sometime after viral HN thread
App banned from Play store because a button says "Report User" and not "Report"
User gets banned from GCP for running the code on Google's own GCP tutorials
Youtube comment anti-spam considered insufficient, so a user creates their own YT anti spam
Search for product reviews generally returns SEO linkfarm spam and not useful results
- See also, my post on the same topic
Google account with thousands of dollars of apps banned from Google with no information on what happened and appeals rejected
- account eventually restored after viral Twitter thread
Linux Experiments Youtube Channel deleted with no reason given
- channel restored shortly after viral Twitter and HN threads
Warranty replacement Google Pixel 7 Pro is carrier locked to the wrong carrier and, even though user is in Australia, the phone is locked to a U.S. carrier
- User has gone to Google support 8 times over 1 month and Google support has incorrectly told user 8 times that the phone is unlocked, so user has had no usable phone for 1 month; the carrier the phone is locked to agrees that the phone is incorrectly carrier locked, but they can't do anything about it since the original purchaser of the phone would have to call the carrier, but apparently the warranty replacement is a locked, used, phone
- Possibly due to the reddit thread, Google support agrees to swap user's phone, but support continues to insist that the phone is not carrier locked
Malware uses Google OAuth to hijack accounts
- Google claims they've mitigated this for all accounts that were compromised, which could be true
GCP account suspended for no discernable reason after years of use
- Support was useless, but since the user used to work at Google, they emailed a former co-worker who sent an internal email, which caused the issue to get fixed immediately
Obviously fake Google reviews for movie not removed for quite some time (obviously fake because many reviews copy+paste the exact same text)
Google doesn't detect obviously fake restaurant reviews
- I've noticed this as well locally — a new restaurant will have 100+ 5 star reviews, almost all of which look extremely fake; these reviews generally don't get removed, even years later
Owner and developer at SaaS app studio 7 out of 100 apps (that use the same code) start getting rejected from app store
- The claimed reason is that the apps allow user generated content (UGC) and therefore need a way to block and report the content, but the apps already have this
- The developer keeps emailing support, explaining that they already have this and support keeps responding with nonsense like "We confirm that your app ... does not contain functionality to report objectionable content ... For more information or a refresher, we strongly recommend that you review our e-learning course on UGC before resubmission."
- All attempts to escalate were also rejected, e.g., "Can you escalate this?" was responded to with "Unfortunately, we do not handle this kind of concern. You may continue to communicate with the appropriate team for further assistance in resolving your issue. Please understand that I am not part of the review team so I'm not able to give further information about your concern. I again apologize for the inconvenience." and then "As much as I'd like to help, I'm not able to assist you further. If you don't have any other concerns, I will now have to end our chat to assist other developers. I apologize and thank you for understanding. Have a great day. Bye!"
- Multiple developers suggest that instead of interacting with Google support as if anyone actually pays attention or cares, you should re-submit your app with some token change, such as incrementing an internal build number; because Google's review process is nonsense, even serious concerns can be bypassed this way. The idea is that it's a mistake to think that the content of their messages makes any sense at all and that you're dealing with anything resembling a rational entity (see also.
Google groups is a massive source of USENET spam
Google groups is a massive source of USENET spam
Google groups is a massive source of USENET spam
Google groups is a massive source of email spam; a Google employee put information about this into a ticket, which did not fix the issue, nor does setting "can't add me to groups"
Google locks user out by ignoring authenticated phone number change and only sending auth text to old number
I had an issue related to the above, where I was once locked out of Google accounts while traveling because I only took my code generator and left my 2FA tokens at home; this was in the relatively early days of 2FA tokens and I added the tokens to reduce the odds that I would be locked out, because the documentation indicated that I would need any of my 2FA methods to be available to not get locked out; in fact, this is false, and Google will sometimes only let you authenticate with specific methods, so adding more methods actually increases the chances you'll get locked out if your concern is that you may lose a method and then lose access to your account
Google allows user to pay for plan with unlimited storage, cancels unlimited storage plan, and then deletes user's data
- Many HN commenters on the story tell the user they should've had other backups, apparently not reading the story, which notes that the user concurrently had a government agency take all of their hard drives
Google closes company's Google Cloud account over 3 cent billing error, plus some other stories
YouTube doesn't take down obvious scam ads when reported, responding with "We decided not to take this ad down. We found that the ad doesn’t go against Google’s policies"
YouTube doesn't take down obvious scam ads
Incorrect YouTube copyright takedown
YouTube copyright claim for sound of typing on keyboard; fixed after Twitter thread goes viral
Another YouTube copyright claim for sound of typing on keyboard; again fixed after Twitter thread goes viral
User puts free music they made on YouTube, allowing other people to remix it; someone takes YouTube ownership of the music, fixed after user, one of the biggest YouTubers of all time, creates a video complaining about this
Developer's app removed from app store for no discernible reason (allegedly for "user privacy") and then restored for no discernable reason
YouTube copyright claim for white noise
YouTube refuses to take down obvious scam ad
YouTube refuses to take down scam ads for fake medical treatments
YouTube refuses to take down scam ads
Google doesn't take down obvious scam ads with fake download buttons
- Mitigated on user's site by hiring a firm to block these ads post-auction?
YouTube refuses to take down fraudulent ad after reporting
Personally reporting scam ads to an employee at Google who works in the area causes ads to get taken down for a day or two, but they return shortly afterwards
Google refuses to take down obvious scam ads after reporting
Google refuses to take down obvious scam ad, responding with "We decided not to take this ad down. We found that the ad doesn’t go against Google’s policies, which prohibit certain content and practices that we believe to be harmful to users and the overall online ecosystem."
YouTube refuses to take down obvious real estate scam ad using Wayne Gretzky, saying the ad doesn't violate any policy
Straighforward SEO spam clone of competitor's website takes their traffic away
User had negotiated limit of 300 concurrently BigQuery queries and then Google decided to reduce this to 100 because Google rolled out a feature that Google PMs and/or engineers believed was worth 3x in concurrent queries; user notes that this feature doesn't help them and that their query limit is now too low; talking to support apparently didn't work
User keeps having their tiny GCP instance shut down because Google incorrectly and nonsensically detects crypto mining on their tiny instance
User has limit on IPs imposed on them and the standard process for requesting IPs returned "Based on your service usage history you are not eligible for quota increase at this time"; all attempts to fix this via support failed
Google Maps gives bad directions to hikers who get lost
Search and rescue teams warn people against use of Google Maps
Google's suggested American and British pronunciations of numpy
CEO of Google personally makes sure that a recruiter who accidentally violated Google's wage fixing agreement with Apple is fired and the apologies to CEO of Apple for the error
Developer's app rejected from app store and developer given the runaround for months
- They keep getting support people telling them that their app doesn't do X, so they send instructions on how to see that the app does do X; their analytics show that support never even attempted to run the instructions and just kept telling them that their app didn't do X
One of many examples of Google not fixing Maps errors when reported, resulting in people putting up a sign telling users to ignore Google Maps directions
- Some more examples here
SEO spam of obituaries creates cottage industry of obituary pirates
Malware app stays up in app store for months after app is reported to be malware
- The app now seems to be gone, but archive.org indicates that the app was up for at least six months after this person noted that they reported this malware which owned their parents
User reports Google's accessible audio captchas only let you do 2-3 before banning you and making you do image-based captchas, making Google sites and services inaccessible to some blind people
User gets impossible versions of Google's ReCaptcha, making all sites that use ReCaptcha inaccessible; user is unable to cancel paid services that are behind ReCaptcha and is forced to issue chargebacks to stop payment to now-impossible to access services
User can't download India-specific apps while in India because Google only lets you change region once a year
3 year old YouTube channel with 24k subs, 100 videos, and 400 streams deleted, allegedly for saying "Don't hold your own [bitcoin] keys", which was apparently flagged as promoting illegal activity
- YouTube responds with "we've forwarded this info to the relevant team and confirmed that the channel will remain suspended for Harmful or dangerous content policies" and links to a document; the user asks what content of theirs violates the policies and why, if the document says that you get 3 strikes your channel is terminated, the account was banned without getting 3 strikes; this question gets no response
Snow closure of highway causes Google Maps to route people to unplowed forest service road with 10 feet of snow
Google play rejects developer's app for nonsense reasons, so they keep resubmitting it until the app doesn't get rejected
Washed out highways due to flooding causes Google Maps to route people through forest service roads that are in even worse condition
Google routes people onto forest service roads that need an offroad vehicle to drive; users note that they've reported this, which does nothing
Google captchas assume you know what various American things are regardless of where you are in the world
Google AMP allows phishing campaigns to show up with trusted URLs
- People warned Google engineers that this would happen and that there were other issues with AMP, but the response from Google was that if you think that AMP is causing you problems, you're wrong and the benefit you've received from AMP is larger than the problems it's causing you
User reports that chrome extension, after getting acquired, appears to steal credit card numbers and reviews indicate that it now injects ads and sometimes (often?) doesn't work
- 6 months ago, user tried to get the extension taken down, but this seems to have failed (the Firefox extension is also still available)
User has their Google account banned after updating their credit card to replace expiring credit card with new credit card (both credit cards were from the same issuer, had the same billing address, etc.)
Reporting a spam youtube comment does nothing
BBC reports bad ads to Google and Google claims to have fixed the issue with an ML system, but follow-up searches from the BBC indicate that the issue isn't fixed at all
User signs up for AdSense and sells thousands of dollars of ads that Google then doesn't allow the user to cash out
- This is a common story that I've seen hundreds of times. Unsurprisingly, multiple people respond and say the same thing happened to them and that there's no recourse when this happens.
User has their Google account (Gmail) account locked for no discernable reason; account recovery process and all appeals do nothing
- For unknown reasons, after two years, the account recovery process works and the account is recovered
User has their Google Pay account locked for "fraud"; there's a form you're supposed to fill out to get them to investigate, which did nothing three times
- User had their phone through googlefi, email through Gmail, DNS via Google, etc., all of which stopped working
- A couple years later, their accounts started working again for no discernable reason
User gets locked out of Gmail despite having correct password and access to the recovery email (Gmail tells user their login was suspicious and refuses to let them log in)
- I've had this happen to me as well when I had my correct password as well as a 2fa device; luckily, things started working again later
User can't get data out of Google after Google adds limit on how much data account can have
User notes that they're only able to get support from Google because they used to work there and know people who can bypass the normal support process
Google takes down developer's Android app, saying that it's a clone of an iOS app; app was making $10k/mo
- Developer finds out that the app Google thinks they're cloning is their own iOS app
- Developer is able to get unbanned, but revenue never recovers an settles down to $1k/mo. Developer stops developing Android apps
User finds that if they use "site:" "wrong", Google puts them into CAPTCHA hell
- Another user notes that this happens to them with other query query modifiers
Reporting malware Chrome extensions doesn't get them taken down, although some do end up getting taken down after a blog post on this goes viral
User accidentally gains admin privileges to many different companies Google Cloud account and can't raise any kind of support ticket to get anyone to look at the problem
- Multiple people respond and tell stories about how bad Google's paid support is compared to AWS support
15 year ol Gmail account lost with no recovery possible
- Someone who helps many people with recovery says "they've all basically hit the brick wall of Google suggesting that at their scale, nothing can be done about such 'edge' cases"
Google account lost despite having proper auth because Google deems login attempts too suspicious
Google account lost despite having proper auth and access to backup account because Google deems login attempts too suspicious
Google account lost despite having proper auth because Google deems login to be too suspicious
Google account lost despite having proper auth and TOTP because Google deems login to be too suspicious
Google account lost despite proper auth because Google deems login to be too suspicious
- Person notes that they can log in when they travel back to the city they used to live in, but they can't log in where they moved to
Google account lost despite proper auth for no known reason
- Account login restored for no known reason a few months later
User tries to log into Gmail account and gets ~20 consecutive security challenges, after which Gmail returns "You cannot log in at this time", so their account appears to be lost
Google changes terms of service and reduces user's quota from 2TB to 15GB, user is unable to find any way to talk to a human about this and is forced to pay for a more expensive plan to not lose their data
- YouTube account with single video and no comments banned for seemingly no reason, support requests do nothing
Huge YouTube channel shut down
- Someone defends this as the correct action because "Their account got session jacked and taken over by a crypto scamming farm. Google was in the right to shut down the account until it could get resolved."
  - Someone who is actually familiar with what's going on notes that this is nonsense, "Their account was shut down days after the crypto scam issue was resolved. They discussed it on the WAN show from the week before last."
Many users run into serious problems after Google decides to impose 5M file limit on Google Drive without warning
- Google support replies with "I reviewed your case here on our end including the interactions with the previous representatives. This case has already been endorsed to one of our account specialists. What they found out is that the error is working as intended"
- On HN, the top comment is a Google engineer responding to say "I don't personally think that there are reasonable use-cases for human users with 5 million files. There may be some specialist software that produces data sets that a human might want to back up to Google Drive, but that software is unlikely to run happily on drive streamed files so even those would be unlikely to be stored directly on Drive." and a multiple people agree, despite the issue itself being full of people describing how they're running into issues
  - Someone notes that Google Drive advertises storage tiers up to 30TB, so 5M files would be 6MB at 30TB, not really a weird edge case of someone generating a bunch of tiny files or anything like that
  - Another user responds their home directory contains almost 5M files
- The top HN reply to the #2 comment is a second Google engineer saying that Google Drive isn't for files (and that it's for things like Google Docs) and that people shouldn't be using it to store files
  - Someone notes that Google's own page for drive advertises it as a "File Sharing Platform"; this doesn't appear to have changed since, as of this writing, the big above-the-fold blurb on Google's own page about drive is that you can "Store, share, and collaborate on files and folders from your mobile device, tablet, or computer". Unsurprisingly, users indicate that they think Google Drive is for files
- In low ranked HN comments, multiple people express surprise that Google didn't bother to contact the users who would be impacted by this change before making it
- This Google engineering attitude of "this is how we imagine users use our product, and if you're using it differently, even if that's how the product is marketing, you're using it wrong" was a very common attitude when I worked at Google and I see that it hasn't changed.
Chrome on Android puts tabloid stories and other spam next to frequently used domains/links
Google pays Apple to not compete on search
Google search has been full of scam ads for years
r/blender repeatedly warns people over many months to not trust results for blender since top hit is malware.
Rampant nutritional supplement scam advertising on Google
Top search result for local restaurant is a scam restaurant
User reports massive about obvious phishing and spam content makes it through Gmail's spamfilter, such as an ad that either steals your payment info or gets you to buy super overpriced gold
High-ranking / top Google results for many pieces of software is malware that pays for a high ranking ad slot
- Reporting this malware doesn't seem effective and the same malware ads can persist for very long periods of time unless someone contacts a Google engineer or makes a viral thread about the ad
Top result for zoom installer is an ad that tries to get you to install badware
User sees a huge number of scam ads on YouTube
User sees a huge number of scam ads on YouTube
User's list of wedding vendors they're using to organize a wedding tagged as phishing and user is warned for violating Google Drive's phishing policy
- User tried to get more information but found no way to do so
Corp security notes that it's very easy to send phishing emails to employees of corporation by passing it through Google Groups
Google account lost because 2FA recovery process doesn't work
- User lost their Google Authenticator 2FA when their phone broke. They have their backup recovery codes, but this only lets them log into their account (and uses up a code forever when logging in); after logging in, this does not enable them to change their 2FA, so each login is a countdown to losing their account
- In the HN comments, some people walk them through the steps to change their 2FA when using backup codes, which works for other users but not this user — user believes that some kind of anti-fraud system is suspicious the user is fraudulent, which limits what kind of 2FA enables changing 2FA, requiring the original and now lost 2FA to change 2FA, making the recovery codes useless; in standard internet comment style, some people keep telling the user that this works and the user should simply do the steps that work, even though the user has explained multiple times that this does not work for them
- Someone suggests buying Google One support, but someone else notes that Google One support appears to be very poor even though it's paid support, and people have noted on many other threads that even cloud support can be useless when spending millions, tens of millions, or hundreds of millions a year, so the idea that you'll get support from Google because you pay for is isn't always correct
- Multiple people have reported the exact same issue and many people report that their mitigation to this is to score the 2FA secrets in their password manager; they know that this means that a computer and/or password manager compromise defeats their 2FA, but they feel that's better than randomly losing their account because the 2FA backup codes can simply not work if Google decides that they're too suspicious
  - Someone suggests setting multiple Yubikeys to prevent this issue. That sounds logical, but I've done this and I can report that it does not prevent this issue — I added multiple 2FA tokens in order to reduce the chance that losing 2FA tokens would cause me to get locked out; at one point, Google became suspicious of one of the 2FA I used to log in almost every time and required me to present another 2FA token, making my idea of having multiple 2FA tokens reduce the risk of a lockout actually backfire since, if Google becomes suspicious of the wrong 2FA tokens, losing any one out of N 2FA tokens could cause my account to become lost
User loses Gmail account after Google system decides the phone numbers they've been using for verification "cannot be used for verification"
- Another user looks into it and finds that Google's official support suggestion is to create another account, so the anti-fraud false positive means that this person lost their Gmail account
User locked out of account after password change; user is sure they're using the correct password because they're using a password manager
- As with the above cases, the password reset flow doesn't work; after five weeks of trying daily, doing the exact same steps as each other time worked, so the account was only lost for five weeks and not forever
User complaints that their Google accounts have been moderately broken for 10 years due to forced Google+ migration in 2013 that left their account in a bad state
User locked out of Google after changing password
- Google asks the user to enter the new and old password, which the user does, but this doesn't enable logging in
- Google sometimes asks the user to scan a QR code from a logged in account, but the user can't do this because they can't log in
- User changed password to main and recovery accounts at the same time, so they're locked out of both accounts
- For no discernable reason, repeatedly trying to get into the recovery account eventually worked, which allowed them to eventually get back into their main account
User gets locked out of Gmail account when Gmail starts asking for 10+ year old password as well as current password to log in
- User finds a suggestion on an old support forum to not try to log in for 40+ days and then try, at which point the user is only asked for their current password and can log in
  - This is clearly not a universal solution as there are examples of people who try re-logging in every year to lost accounts, which usually doesn't work, but this apparently sometimes works?
- Someone posts the standard reply about how you shouldn't expect to get support unless you pay for Google One, apparently ignoring how every time someone posts this, people respond to note that Google One support rarely fixes problems like these
User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires access to recovery email address, which has lapsed
- A comment blaming the user from someone who apparently didn't read the post
User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires an old phone number which is no longer active
- This turned out to be another case where waiting a long time and then trying to log in worked
User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires an old phone number which is no longer active
- In this case, waiting a long time and then trying to log in didn't work and the account seems permanently lost
User loses Gmail account because Gmail suddenly refuses to allow access the correct password; user has the recovery email as well, but that doesn't help
- After three years of trying to log in every few months, logging in worked for no discernable reason, so the account was only lost for three years
Google gives away user's Google Voice number, which they were using daily and had purchased credits on that were also lost
- Someone who apparently didn't read the post suggests to the user they shouldn't have let the number be inactive for 6 months or they should've "made the number permanent'
- Support refuses to refund user for credits and user can't get a new Google Voice number because the old one is still somehow linked to them and is considered a source of spam
User loses Gmail account when recovery account token doesn't work
User loses Gmail account when credentials stop working for no discernible reason
User has an issue with Google and talks to support; support tells user to issue a chargeback, which results in user's account getting banned and user losing 15 years of account history
User is in the middle of getting locked out of Google accounts and makes a viral post to try to get a human at Google to look at the issue
John Carmack complains about having "a ridiculous time" with Google Cloud, only getting his issue resolved because he complained on Twitter an is one of the most famous programmers on the planet, decided to move to another provider after the second time this happened
Developer documents years of incorrect Google Play Store policy violations and how to work around them
- Someone claiming to have worked on the Google Play Store team says: "a lot of that was outsourced to overseas which resulted in much slower response time. Here stateside we had a lot of metrics in place to fast response. Typically your app would get reviewed the same day. Not sure what it's like now but the managers were incompetent back then even so."
Developer notes that they sometimes get updates rejected from Google Play store and have even had their existing app get delisted, but that the algorithm for this is so poor that you can make meaningless changes, which has worked for getting them relisted every time so far
Developer banned from Google Play, but they win the appeal
- However, the Name / Namespace (com.company.project) continues to be blocked, so they'd have to refactor the app and change the product and company name to continue using Google Play
Developer describes their process of interacting with the Google Chrome Webstore, which involves so much kafkaesque nonsense that they have semi-automated handling of the nonsense they know they'll encounter
Developer has comical, sub-ChatGPT level interactions with "Chrome Web Store Developer Support" (see link for multiple examples)
User complains about repeated nonsense video demonetization and age limiting , such as this I ate water with chopsticks getting a strike against it for "We reviewed your content carefully, and have confirmed that it violates our violent or graphic content policy", with a follow-up of "your video was rated [sic] limited by ML then mistakenly confirmed by a manual reviewer as limited .... we've talked to the team to ensure it doesn't happen again", but of course this keeps happening, which is why the user is complaining (the complaint comes after the video was restricted again and the appeal was denied twice, despite the previous comment about how YouTube would ensure this doesn't happen again).
User has YouTube video incorrectly taken down for violating community guidelines, but it gets restored after they and another big YouTuber both write viral Twitter threads about the incorrect takedown
User notes that Gmail's spam filtering appears to be getting worse
- I remember this one because, when this user complained about it, I noticed that I was getting multiple spam emails per day (with plenty of false positives when I checked my spam folder as well)
- This complaint from a user was also memorable to me since I was getting the exact same spam as this user
User notes that Google (?) consistently sends you the wrong way into a highway offramp
User's video on the history of megaman speedruns becomes age restricted, which also mostly demonetizes it?
- User appeals, and 45 minutes later, they get a response saying "after careful review, we've confirmed that the age restriction on your video won't be lifted" (video is 78 minutes long)
  - User then quotes YouTube's own guidelines to show that their video doesn't appear to violate the guidelines
- User tweets about this, and then YouTube replies saying they lifted the age restriction, but the video stopped getting recommended, so the video was still not making money (this user makes a living off YouTube videos)
- 8 days later, the video is officially age restricted again, and they say that the previous reversal was an error
- User then makes a video about this and tweets about the video, which then goes viral.
- YouTube then responds after the tweet about getting the runraround goes viral, with "really sorry again that this was such a confusing / back and forth experience 😞. we’ve shared your video with the right people & if helpful, keep sharing more w/ our community outreach team on that same email too!!"
When Jamie Brandon was organizing a database conference, Gmail spamfiltered the majority of emails he sent out about it
- ~700 people signed up to be notified when tickets were available, but even though they explicitly signed up to get notified, Gmail still spamfiltered Jamie's email
Author publishes a book about their victimization and sex crimes; Google then puts a photo of someone else in an automatically generated profile for the author
- "After spending weeks sending feedback and trying to get help from Google support, they finally deleted the woman’s photo, but then promptly replaced it with another Andrea Vassell who is a pastor in New York. She, the pastor in New York, wrote to me that she has been 'attacked' because people believe she is me."
- That the person was a pastor of a church also caused problems for some people mentioned in the book; author again tries to get the photo removed, which eventually works, but is then replaced by the photo of a man who'd been fired for threatening the author, and then months later, the pastor's photo showed up again as the author
- Author appears to be non-technical and found HN and is writing a desperate plea for someone to do something about it
- A Google employee whose profile reads "Google's public liaison of Search, helping people better understand search & Google better hear public feedback", responds with "I'll share more about how you can better get this feedback to us ... [explanation of knowledge panels] ... Normally people just don't like the image we show, so we have a mechanism for them to upload a preferred image. That's very easy to use. But in your case, I understand your reasons for not wanting to have an image used at all. I believe if you had filed feedback explaining that, the image would have been removed."
- Author is dumbfounded given her lengthy explanation of how much feedback she has already provided and responds with "Are you suggesting that I did not send feedback through the appropriate channels? I have dozens of email exchanges with Google, some of which have multiple people copied on them, and I have screenshots of me sending feedback through your feedback link located within the knowledge panel. (And I explained my situation to them with more detail than I have explained here.). In April and May, I received email responses from Google employees who work for the knowledge panel support team. After they changed the photo twice to images of the wrong women instead of deleting them, I continued complaining and they suggested I contact legal removals. When I contacted legal, I received automated responses to contact the knowledge support team. So I was bounced around. They then began ignoring me and I started receiving automated responses from everyone. Even though I was being ignored, on any given day, I would wake up and find a different photo presented alongside my book. I also reached out to you, Danny Sullivan, directly"
Famous sci-fi author spends years trying to get Google to stop putting photos of other people in their "knowledge panel"
- This seems to currently be fixed, and it only took between five years and a decade to fix it.
User notes that knowledge panel for them is misleading or wrong, and that attempts to claim the knowledge panel to fix this have failed
Google knowledge panel for person incorrectly states that they are "also known as The Sadist ... a Bulgarian rapist and serial killer who murdered five people... "
- Fixed after a story about this makes it to #1 on HN
User notes that Google's knowledge panels about business often contain incorrect information even when the website for the business has correct information
Company reaches out to candidate about a job, eventually giving them an offer. The offer acceptance reply in email is marked as spam by everyone at the company
- On looking in the spam folder, one user at the company (me) finds that 19 out of 20 "spam" emails are actually not spam. Other users check and find a huge amount of important email is being classified as spam.
- Google support responds with what appears to be an automated message which reads "Hi Dan. Our team is working on this issue. Meanwhile, we suggest creating a filter by selecting 'Never send it to spam' to stop mail from being marked as spam", apparently suggesting that everyone with a corp gmail account disable spam filtering entirely by creating a filter that disables the spam filter
  - One person responds and says they actually did this because they were getting so much important email classified as spam
"Obvious to humans" spam gets through Gmail's spam filter all the time while also classifying "ham" as "spam"
I emailed a local window film installer and their response to me, which quotes my entire email, went straight to spam

Facebook (Meta)

Journalist's account deleted and only restored after Twitter thread on deletion goes viral
Facebook moderator notes there's no real feedback or escalation path between what moderators see and the people who set FB moderation policy
User banned from WhatsApp with no reason given
- appeal resulted in a generic template response
Instagram user can no longer interact with account
- would like to remove account, but can't because login fails
Multiple users report they created a FB account so they can see and manage FB ads; accounts banned and can no longer manage ads
User banned after FB account hacked
- account restored after viral HN story
On a local FB group, user posts "Looking for some tech advice (admins delete if not allowed)... my Instagram account was hacked and I have lost all access to that account. The guy is still posting as me daily and communicating to others as me in messages (its a bitcoin scam). Does anyone know how I can communicate with Instagram directly? There does not appear to be any way to contact them and all the instructions I've followed lead me nowhere bc I have completely lost access to that account! 😫 Thank you!"
- Someone suggests Instagram's instructions for this, https://help.instagram.com/368191326593075, but user replies and says that these didn't work because "I did all that but unfortunately the hacker was in my email and and verified all the changes before I noticed"
- I replied and suggested searching linkedin for a connection to an employee, since the only things that work are internal escalation or going viral
Facebook incorrectly reports a user to DigitalOcean for phishing for a blog post they wrote
- DigitalOcean sends them an automated message saying that their droplet (machine/VM) will be suspending if they don't delete the offending post within 24 hours
- user appeals and appeal goes through; unclear if it would've gone through without the viral HN thread about this
User banned from FB marketplace for "violating community guidelines" after posting an ad for a vacuum
- user appeals multiple times and each appeal is denied, ending with "Unfortunately, your account cannot be reinstated due to violating community guidelines. The review is final"
Reporting post advocating for violence against a person does nothing
Reporting post where one user tells another user to kill themselves does nothing
Murdered person is flooded with racist comments; friends report these, which does nothing
40000 word series of articles by Erin Kissane that I'm not going to attempt to summarize
Facebook doesn't take down obvious scam ads after reporting them
User stops reporting obvious scam ads to Facebook because they never remove them, always saying that the ad didn't breach any standards
Takeover of dead person's Facebook account to run scams
See "Kamantha" story in body of post
Facebook refuses to do anything about account that was taken over and constantly posts scams
Facebook refuses to do anything about fake page for business
Reporting scammer on facebook does nothing
Paying for "creator" level of support on Facebook / insta appears to be worthless
- Reviews is that support is sort of nice, in that you get connected to a human being who isn't reading off a script, but also useless. At one point Jonny Keeley had a video didn't upload and support's recommendation was to try editing the video again and uploading it again. Keeley asked support why that would fix it and the answer was basically, there's no particular reason to think that it might fix it, but it might also just work to re-upload the video. Another time, Keeley got "hacked" and went to support. Support once again responded quickly, but all they did was send him a bunch of links that he thinks are publicly available. Keeley was hoping that support would fix the issue, but instead they appear to have given him information he could've googled.
Zuckerberg rejected proposals to improve teen mental health from other FB execs
- article notes that "that a lack of investment in well-being initiatives meant Meta lacked 'a roadmap of work that demonstrates we care about well-being.'"
Malicious Google ad from Google-verified advertiser; ad only removed after major publication writes a story about it
- A user notes that something that amplifies the effectiveness of this attack is that Google allows advertisers to show fake domains, which is necessary for them to do tracking as they currently do it and not show some weird tracking domain
User gets lifetime ban from running ads because they ran ads for courses teaching people to use pandas (the python library)
- User hits appeal button on form and is immediately banned for life. Someone notes that the appeal button is a trap and you should never hit the appeal button???. Apparently you should fill out some kind of form that you won't be able to fill out if you hit the appeal button and are immediately banned?
User notes pervasive scam ads
You can deactivate anyone's WhatsApp account by sending an email asking for it to be deactivated
- This is sort of the opposite of all those scam FB accounts where reporting that the account is scamming does nothing
User has innocuous Threads message removed for "violating Community guidelines", and then asks why there's so much spam that doesn't get removed but their message gets remove
User has Threads message removed with message saying that it violates community guidelines; message is a reply to themselves that reads "(Also, please don't respond to this with some 'well, on the taxpayer funding front, I think they have a point...' stuff. If you can read an article that highlights efforts to push people like me out of society and your takeaway is 'Huh, I think those people have a point!' then I'd much rather you not comment at all. I "
- Like many others, user notes that they've repeatedly reported messages that do actually violate community guidelines and these messages don't get removed
Rampant fraud on Instagram, Facebook, and WhatsApp
Meta moderation in Kenya
Facebook removes post showing art, electronics, and wheelchair mods is "hate speech"
- No support action does anything, but the post is restored after the story about this goes viral
User notes that stories that vaguely resemble holding a gun to one's head, such as holding a hair dryer to one's head, get flagged
User reports threads desktop isn't usable for them for 6 weeks and then suddenly starts working again; logging in on most browsers give them an infinite loop
Dead link due to Mastodon migration, but comment about FB spam which used to be accessible in https://mastodon.social/@jefftk@mastodon.mit.edu/109826480309020697
User banned from Facebook's 2FA system (including WhatsApp, Insta, etc., causing business Insta page to get deleted) due to flaw in password recovery system
- Despite having 2FA enabled, someone was able to take over this person's FB account. On appealing this, user is told "We've determined that you are ineligible to use Facebook"
- User also used FB login for DMV and is no longer able to log into DMV
- New accounts the user creates are linked to the old account and banned. Someone comments, "lol so they can identify/verify that but somehow fail to fingerprint login from Vietnam and account hijacking."
- As usual, multiple people have the standard response that it's the user's fault for using a big platform and that no one should use these platforms, with comments like "It's common sense and obvious, yet whenever it gets mentioned, the messenger gets dunked on for victim blaming or whatever ... Somehow, this is a controversial opinion on HN" (there are many more such comments; I just linked a couple)
  - The author, who already noted in the post that his industry is dependent on Instagram asks "Please educate me on how to get the potential clients to switch platforms that they use to view pictures?" and didn't get a response; presumably, as is standard, none of these commenters read the post, although perhaps some of them just think that no one should work in these industries and anyone who does so should go flip burgers or something
User's account banned after it was somehow compromised from a foreign IP
- User gets the standard comment about how FB couldn't possibly review cases like this due to its scale
User effectively banned from Facebook due to broken password recovery process, which requires getting total strangers to vouch that you're a real person, presumably due to some bad ML.
- Afterwards, some scammer created a fake profile of the person, so there's now a fake version of the person around on FB and not a real one
User effectively banned from FB due to bad "suspicious activity" detection despite having 2FA on and having access to their password and 2FA
User repeatedly has account suspended for no discernable reason
User effectively banned from FB until a friend of theirs starts a jos a job there, at which point their friend opens an internal ticket to get them unbanned
User banned from FB after account hacked
User banned from FB after account hacked
- See comments for many other examples
User banned from facebook after account hacked
- Luckily for the user, this made the front page of HN and was highly ranked, causing some FB engineers to reach out and then fix the issue
- Of course the HN post has the standard comments; one commenter suggests that people with the standard comments actually read the article before commenting, for once: "Anyone saying 'Good riddance! Go enjoy your life without Facebook!' is missing the point. Please read this bit from the article:"Thing is I’m a Mum of two who has just moved to a new area. Facebook groups have offered me support and community, and Mums I’ve met in local playgrounds have added me as a friend so we can use messenger to plan playdates. Without these apps sadly my little social life becomes a lot lonelier, and harder."
  - Undeterred, commenters respond to this comment with things like "this might actually have been a blessing in disguise--just the encouragement she needed to let go and move on from this harmful platform."
People who don't know employees at FB who can help them complain on Google Maps about their Facebook's anti-fraud systems
User banned from Facebook after posting about 12V system on Nissan Leaf in Nissan Leaf FB group
- The post was (presumably automatically) determined to have violated "community standards", requiring identify verification to not be banned
- "OK, I upload my driving licence. And it won't accept the upload. I try JPEG, PNG, different sizes, different browsers, different computers. Nothing seems to stick and after a certain number of attempts it says I have to wait a week to try again. After as couple of rounds of this the expiry deadline passes and my account is gone."
Person notes that their wife and most of their wife's friends have lost their FB accounts at least once
Person notes that girlfriend's mother's account was taken over and reporting the account as being taken over does nothing
- The person, a programmer, finds it odd that taking over an account and changing the password, email, profile photo, full name, etc., all in quick succession doesn't trigger any kind of anti-fraud check
User reports that you can get FB accounts banned by getting people in groups dedicated to mass reporting accounts to all report an account you want to get banned
Someone wrote a spammy reply to a "tweet" of mine on Threads that was trending (they replied with a link to their substack and nothing else). I reported it and, of course, nothing happened. I guess I neeed to join one of the mass reporting groups to ask a bunch of people to report the spam.
User is locked out of account and told they need to upload photo ID, which does nothing
- Six months later, user gets to know a Facebook employee, who gets them unbanned
User has Facebook account banned and can't get it unbanned
- User tried to contact FB employees on linkedin, which failed
- User then used instagram to meet FB employees and sleep with them, resulting in the account getting unbanned
User is effectively banned from instagram because they logged in from a new device and can't confirm a no-longer active email
User gets FB account stolen, apparently bypassing 2FA check the user thought would protect them
- White male, father of 3 FB account replace by young Asian female account, apparently not at all suspicious to FB anti-fraud systems
User finds that someone is impersonating them on Instagram; reporting this does nothing
User has ad account hacked; all attempts to contact support get no response or a useless auto-response or a useless webpage
User reports that there are multiple fake accounts impersonating them and family members and that reporting these accounts does nothing
Relatively early post-IPO Facebook engineer has account banned from Facebook and of course no standard appeal process works
- User reports that their engineering friends inside the company are also unable to escalate the issue, so their account as well as ads money and Oculus games are lost
Sophisticated/technical user gets Instagram account stolen despite using strong random password, password manager, and 2FA
- Crypto people had been trying to buy the account for 6 months and then the account was compromised
- Following Instagram's official instructions for what to do literally results in an infinite loop of instructions
- Instagram claims that they'll send you an email from security@mail.instagram.com if you change the email on your account, but this didn't happen; user looked at their Fastmail logs and believes that their email was not compromised
- User was able to regain their Insta account after the story hit the front page of HN
- Multiple people note that there are services that claim to be able to get you a desired Insta handle for $10k to $50k; it's commonly believed that this is done via compromised Facebook employees. Since there is (in general) no way to appeal or report these things, whatever it is that these services do is generally final unless you're famous, well connected in tech, or can make a story go viral about you
Desirable Instagram handle is stolen
- The first two times this happened, user was able to talk to a contact inside Facebook to fix it, but they lost their contact to Facebook so the handle was eventually stolen and appears to be gone forever
User tries to recover their mother's hacked Instagram account and finds that the recovery process docs are an infinite loop
- They also find that the "click here if this login wasn't you" e-mail anti-fraud link when someone tries to log in as you is a 404
- They also find that if an account without 2FA on gets compromised and the attacker turns on 2FA, this disables all old recovery mechanisms.
User logs in and is asked for email 2FA
- Email never arrives, isn't in spam folder, etc.
- User asks for another code, which returns the error "Select a valid choice. 0 is not one of the available choices."
- Subsequent requests for codes fail. User tries to contact support and gets a pop-up which says "If you’re unable to get the security code, you need to use the Instagram app to secure your account", but the user doesn't have the Instagram app installed, so their account is lost forever
Instagram takes username from user to give it to a more famous user, a common story
User with password, 2FA, registered pgp key (!?) gets locked of account due to some kind of anti-fraud system triggering; FB claims that only a passport scan will unlock the account, which the user apparently hasn't tried
User finds that it's not possible to move Duo 2FA and loses account forever
- According to the user, FB has auth steps to handle this case, which involves sending in ID docs, which the user tries annually. These steps do nothing
User with Pixel phone can't use bluetooth for months because Google releases an update that breaks bluetooth (presumably only for some and not all devices) and doesn't bother to fix it for months
I tried clicking on some Facebook ads (I normally don't look at or click on them) leading up to Black Friday and most ads were actually scams
User reports fake FB profile (profile uses images from a famous dead person) and gets banned after reporting the profile a lot; user believes they were banned for reporting this profile too many times
User makes FB post about a deepfake phishing attack, which then attracts about 1 spam comment per minute that they have to manually delete because FB's systems don't handle this correctly
FB Ad Manager product claims reach of 101M people in the U.S. aged 18-34, but U.S. census has the total population being 76M, a difference of 25M assuming all people in the U.S. in that age group can be reached via FB ads
- Former PM of the ads targeting team says that this is expected and totally fine because FB can't be expected to slice and dice numbers as small as tens or hundreds of millions accurately. "Think at FB scale".
User gets banned from FB for a week for posting sexual content when they posted an image of a pokemon
For maybe five years or so, I would regularly get spam in my feed where a scammer would sell fake sneakers and then tag a bunch of people, tagging someone I'm FB friends with, causing me to get this spam into my feed
- This exact scam doesn't show up in my feed all the time anymore, but tag spam like this still sometimes shows up
Instagram takes down account posting public jet flight information when billionaire asks the to

Amazon

Author notes that 100% of the copies of their book sold on Amazon are counterfeits (100% because Amazon never buys real books because counterfeiters always have inventory)
- Author spent months trying to get Amazon to take action on this; no effect
- Author believes that most high-sale volume technical books on Amazon are impacted and says that other authors have noticed the same thing
Top USB drive listings on Amazon are scams
Amazon retail website asks user to change password; Amazon retail account and AWS stop working
- never restored
Magic card scam on Amazon
- many customers report that "rare" cards were removed from packs bought from Amazon
Counterfeit books sold on Amazon
- seller of non-counterfeit books reported to Amazon various times over the years without effect
User notes that Amazon is more careful about counterfeits in Australia than in the U.S. due to regulatory action, and that Valve only issued refunds in some cases due to Australian regulatory action
User notes that Amazon sells counterfeit Kindle books
Author notes that Amazon sells counterfeit copies of their book
Boardgame publisher reports counterfeit copies of their game on Amazon, which they have not been able to get Amazon to remove
- I saw this on a FB group I'm on since the publisher is running a PR blitz to get people to report the fake copies on Amazon in order to get Amazon to stop selling counterfeits
Amazon resells returned, damaged, merchandise
- This is so common that authors warn each other that this happens and so that other authors know that to leave a note in the book telling the user what happened when authors return damaged author's copies of books
Amazon ships "new" book with notes scribbled on pages and exercises done; on asking for replacement, user gets a 2nd book "new" in similar condition
Top-selling Amazon fuses dangerously doesn't blow at well above rated current
Amazon sells used items as new
Amazon sells used items as new
Amazon sells used items as new
Amazon sells used items as new
Amazon sells used item as new; book has toilet paper inside
Amazon sells used item as new; book has toilet paper inside
Amazon ships HDDs in oversized box with no padding
Amazon sells used or damaged items as new
Amazon sells used microwave full of food detritus as new
Amazon sells used pressure cooker with shopping bag and food smell as new
Amazon sells used vacuum cleaner as new, complete with home address and telephone number of the person who returned the vacuum
Amazon ships incorrect product to user buying a new item, apparently due to someone returned in different item
Amazon sells incomplete used item as new
Amazon sells used items as new
Amazon sells used item as new, complete with invoice for sale 13 years ago, with name and address of previous owner
Amazon selling low quality, counterfeit, engine oil filters
Amazon sells supplements with unlabeled or mislabeled ingredients
- Someone notes that Amazon used to require certification for supplements, but the person who was driving this left Amazon and it appears that no one has picked it up
Amazon sells counterfeit shampoo that burns user's scalp
- User wrote a review, which was deleted by Amazon
Amazon sells damaged, used, items as new
Amazon sells counterfeit supplement
- User wrote a review noting this, which Amazon deleted
Amazon sells box full of trash as new lego set
Amazon sells used item with a refurbished sticker on it as new
User has National Geographic subscription that can't be cancelled through the web interface, so they talk to Amazon support to cancel it; Amazon support cancels their Amazon Prime subscription instead
Amazon sells used, damaged, item as new
Amazon sells used item with Kohl's sticker on it as new
Amazon sells nearly empty boardgame box as new, presumed to be returned item with game removed
Amazon sells counterfeit board game
User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice
User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice
User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice
User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice
Amazon sells counterfeit SD cards; user compares to reference SD card bought at brick and mortar store
- A commenter notes that counterfeit SD cards are so common on Amazon that r/photography has a blanket recommendation against buying SD cards on Amazon
User leave a review noting that product is a scam/fake an review is rejected
Counterfeit lens filter on Amazon; multiple users note that they never buy camera gear (which includes things like SD cards) from Amazon because they've received too many counterfeits
Amazon sells used, dirty, CPU as new CPU; CPU is marked "NOT FOR RESALE" (NFR)
- it's not known why this CPU is marked NFR; a commenter speculates that it was a review copy of a CPU, in which case it would be relatively likely to be a highly-binned copy that's better than what you'd normally get. On the other hand, it could also be an early engineering sample with all sorts of bugs or other issues; when I worked for a CPU company, we would buy Intel CPUs to test them and engineering samples would not only have a variety of bugs that only manifested in certain circumstances, they would sometimes have instructions that did completely different things that could be reasonable behavior, except that Intel had changed the specified behavior before release, so the CPU would just do the wrong thing, resulting in crashes on real software (this happened with the first CPU we were able to get that had the MWAIT instruction, an engineering sample that was apparently from before Intel had finalized the current behavior of MWAIT).
Amazon doesn't refund user after they receive empty box instead of $2100 item, but there's a viral story about this, so maybe Amazon will fix this
Amazon refuses to refund user after sending them an old, used, camera lens instead of high-end new lens
- On the photography forum where this is posted, users note that you should never by camera lenses or other high-end gear from Amazon if you don't want to risk being scammed
Amazon doesn't refund user after sending them a low-end used camera lends instead of the ordered high-end lens
- Users on this photography forum (a different one than the above) note that this happens frequently enough that you should never order camera lenses from Amazon
Amazon refuses to refund user who got an empty box instead of a $7000 camera until story goes viral and Amazon gets a lot of bad press
- Based on the shipping weight, Amazon clearly shipped something light or an empty box and not a camera
User gets constant stream of unwanted Amazon packages
- In response to a news story, Amazon says "The case in question has been addressed, and corrective action is being taken to stop the packages", but the user reports that nothing has changed
Amazon sells user used AirPods, which later causes a problem when they want to use the 1-year warranty because Apple shows an in-service date 2 months before the user bought the item
- To fix this, Apple requests that the user get some evidence from Amazon that the particular serial number was associated with their purchase and Amazon refuses to do this; people recommend that, to fix this, the user do the "standard" Amazon scam of buying a new item and returning a used item to swap a broken used item for a new item
User receives old, used, HD from Amazon instead of new HD
Mechanic warns people not to buy car parts on Amazon because counterfeits are so frequent
- They note that you can get counterfeits in various places, but the rate is highest from Amazon and it's often a safety issue; they're current dealing with a customer who had counterfeit brake pads
- Many other mechanics reply and report similar issues, e.g., someone bought a water pump from Amazon that exploded after 5 months that they believe is fake
User stops buying household products from Amazon because counterfeit rate is too high
User gets counterfeit card game from Amazon
User gets counterfeit board game from Amazon
Amazon sells counterfeit gun parts and accessories
Amazon sells so many counterfeits that board game maker runs a marketing campaign to ask people to stop buying their game on Amazon
- They spent months trying to get Amazon to go after counterfeits without making progress until the marketing campaign; two days after they started it, Amazon contacted them to try to deal with the issue
Searching for items in many categories trivially finds huge number of fraudulent or counterfeit items
User gets counterfeit hair product that burns scalp
User receives used book from Amazon and their friend tells them that it's normal to buy books and return them in the return window, which their friend does all the time
Amazon driver mishears automated response from Eufy doorbell, causing Amazon to shut down user's smarthome (user was able to get smarthome and account back after one week)
- Video footage allegedly shows that the doorbell said "excuse me, can I help you", which lead to an Amazon executive personally accusing this user of racism; when account was unlocked, the user wasn't informed (except that things started working again)
- In the comments to the article, someone says that it's impossible that Amazon would do this, with comments like "None of this makes any sense and is probably 100% false.", as if huge companies can't do things that don't make any sense, but Amazon's official response to a journalist reaching our for comment confirms that the something like the events happened; if it was 100% false, it would be very strange for Amazon to respond thusly instead of responding with a denial or not responding
Youtuber who made a video about the above has their Amazon Associates account deleted after video goes viral
Amazon account gets locked out; support refuses to acknowledge there's an issue until user calls back many times and then tells user to abandon the account and make another one
User has Amazon account closed because they sometimes send gifts to friends back home in Belarus
User gets counterfeit item from Amazon; they contact support with detailed photos showing that the item is counterfeit and support replies with "the information we have indicates that the product you received was authentic"
User gets the wrong GPU from Amazon, twice; luckily for them, the second time, Amazon sent a higher end GPU than was purchased, so the user is getting a free upgrade
Technical book publisher fails to get counterfeits removed from Amazon
- Amazon announced a new system designed to deal with this, but people continue to report rampant technical book counterfeiting on Amazon, so the system does not appear to have worked
ChatGPT clone of author's book only removed after Washington Post story on problem
Searching for children's books on Amazon returns AI generated nonsense
Amazon takes down legitimate cookbook; author notes "They won't tell us why. They won't tell us how to fix whatever tripped the algorithm. They won't seem to let us appeal. Reaching a human at Amazon is a Kafkaesque experience that we haven't yet managed to do."
- When I checked later, not restored despite viral Mastodon thread and highly upvoted/ranked front-page HN article
- Multiple people give the standard response of asking why booksellers bother to use Amazon, seemingly unaware (???) that Amazon has a lot of marketshare and authors can get a lot more reach and revenue on Amazon than on other platforms (when they're not arbitrarily banned) (the author of the book replies and says this as well, but one doesn't need to be an author to know this)
Amazon basically steals $250 from customer, appeal does nothing, as usual
Amazon delivers package directly to food waste / compost bin and declines to provide any compensation
- User notes that they had a nice call with Amazon support and that they hope this doesn't happen again. From my experience with trying to get Amazon to stop shipping packages via Intelcom and Purolator, I suspect this user will have this problem happen again — I've heard that you can get them to not deliver using certain mechanisms, but you have to repeatedly contact support until someone actually puts this onto your file, as opposed to just saying that they'll do it and then not doing it, which is what's happened the two times I've contacted support about this
User receives fake GPU from Amazon, after an attempt to buy from the official Amazon.com msi store
Amazon Fresh order comes with bottle of urine
Amazon sells many obviously fake 16 TB physically tiny SSD drives for $100
- The author sent a list of fakes to Amazon and a few disappeared. The author isn't sure if the listings that disappeared were actually removed by Amazon or if it's just churn in the listings
- An HN commenter searches and also finds many fakes, which have good reviews that are obviously for a different product; someone notes that they've tried reporting these kinds of obvious fakes where someone takes a legitimate product with good reviews and then swaps in a scam product but that this does nothing
- Multiple people note that they've tried leaving 1* reviews for fake products and had these reviews rejected by Amazon for not meeting the review policy guidelines
- Some time after this story made the front page of HN, this class of fakes got cleaned up. However, other fakes that are mentioned in the HN comments (see item directly below this) did not get cleaned up; maybe someone can write an article about how these other items are fake to get these other things cleaned up as well
User notes that bestselling item on Amazon is a fake item and that they tried to leave a review to this effect, but the review was rejected
- I looked up the item and it's still a bestselling item. There are some reviews which indicate that it's a fake item, but this fake item seems to have been on sale for years
Amazon sells Android TV boxes that are actually malware
- It appears that these devices have been on sale on Amazon at least since 2017; I clicked the search query in the link of the above post and it still returns many matching devices in 2014
Amazon scammer causes user to get arrested and charged for fraud, which causes user to lose their job
- The user also notes "In Canada, a criminal record is not a record of conviction, it’s a record of charges and that’s why I can’t work now. Potential employers never find out what the nature of it is, they just find out that I have a criminal arrest record."
- For more information on how the scam works, see this talk by Nina Kollars
An Amazon seller story
- It's unclear exactly what's going on here since some parts of the seller's story appear to be false? Some parts are quite plausible and really terrible if true
Illegal weapon a bestselling item on Amazon, although this does get removed after it's reported
Fake Amazon listings with titles and descriptions like "I'm sorry but I cannot fulfill this request it goes against OpenAI use policy. My purpose is to provide helpful and respectful information to users"
- The most obvious cases seem to have been cleaned up after a story about this hit #1 on HN
- Someone noted that the seller's page is still up (which is still true today) and if you scroll around for listings, other ones with slightly different text, like "I'm sorry I cannot complete this task there isn't enough information provided. Please provide more context or information so I can assist you better " are still up
  - These listings are total nonsense, such as the above, which has a photo of a cat and also says "Exceptional Read/Write Speeds: With exceptional read/write speeds of up to 560MB and 530MB "
- I checked out other items from this seller, and they have a silicone neck support "bowl" that also says "Note: Products with electrical plugs are designed for use in the US. Outlets and voltage differ internationally and this product may require an adapter or converter for use in your destination. Please check compatibility before purchasing.", so it seems that someone at Amazon took down the listings that HN commenters called out (the HN thread on this is full of HN commenters pointing out ridiculous listing and those individual listings being taken down), but there's no systematic removal of nonsense listings, of which there are many
I tried to buy 3M 616 litho tape from Amazon (in Canada) and every listing had a knock-off product that copy+pasted the description of 3M 616 into the description
- It's possible the knock-off stuff is as good, but it seems sketchy (and an illegal trademark violation) to use 3M's product description for your own knock-off product; at least some reviews indicate that expected to get 3M 616 and got a knock-off instead
When searching for replacement Kidde smoke detectors on amazon.ca, all of the one I found are not Canadian versions, meaning they're not approved by SA, cUL, ULC or cETL. It's possible this doesn't matter, but in the event of a fire and an insurance claim, I wouldn't want to have a non-approved smoke detector
Amazon store selling 5 year old tires as new (tires age over time and 5 year old tires should not be sold as new)

Microsoft

This includes GitHub, LinkedIn, Activision, etc.

Microsoft AI generated news articles put person's photo into a story about a different person's sexual misconduct trial
- Other incorrect AI generated stories include Joe Biden falling asleep during a moment of silence for Maui wildfire victims, a conspiracy theory about Democrats being behind the recent covid surge, and a story about San Francisco Supervisor Dean Preston resigning after criticism by Elon Musk; these seem to be a side effect of laying off human editors and replacing them with AI
- Other results include an obituary for a former NBA player who died at age 42, titled "Brandon Hunter useless at 42" and AI generated poll attached to a Guardian article on a deceased 21-year old woman, "What do you think is the reason behind the woman’s death" with the options "murder, accident, or suicide"
User banned from GitHub for no discernable reason
- User happens to be co-founder of GitHub, so this goes viral when they tweet about it, causing them to get unbanned; GitHub's COO responds with "You're 100% unsuspended now. I'm working with our Trust & Safety team to understand what went wrong with our automations and I'm incredibly sorry for the trouble."
Gary Bernhardt, a paying user of GitHub files a Privacy / PII Github support request
- ignored for 51 days, until Gary creates a viral Twitter thread
LinkedIn user banned after finding illegal business on LinkedIn and reporting it
- seems like the illegal business used their accounts to mass report the user
LinkedIn user banned for looking at too many profiles
- appeal rejected by customer service
- this also happened to me when I was recruiting and looking at profiles and I also got nonsens responses from customer service, although my account wasn't permanently banned
Azure kills entire company's prod subscription because Azure assigned them a shared IP that another customer used in an attack
GitHub spam is out of control
Outlook / Hotmail repeatedly incorrectly blocks the same mail servers; this can apparently be fixed by:
- Visit https://olcsupport.office.com/ and submitting the complaint; Wait for the auto-reply, followed by the "Nothing was detected" email; replying with "Escalate" in the body, which then causes the server to get unblocked again in a day
User reports that, every December, users on the service get email rejected by Microsoft, which needs to be manually escalated every year
User running mail server on same IP for 10+ years, with no one else using IP, repeatedly has Microsoft block mail from the IP address, requiring manual escalation to fix each time
Whitelisting a server doesn't necessary allow it to receive email if Microsoft decides to block it; a Microsoft employee thinks this should work, but it apparently doesn't work
Microsoft arbitrarily blocks email from user's server; after escalation, they fix it, but only for hotmail and live.com, not Office 365
OpenAI decides user needs to do 80 CAPTCHAs in a row to log in
- In response to this, someone sent me: "Another friend of mine also had terrible issues even signing up for openai -- they told him he could only use his phone number to sign up for a maximum of 3 accounts, and he tried telling them that in fact he had only ever used it to sign up for 1 account and got back the same answer again and again (as if they use their own stuff for support) ... he said he kept trying to emphasize the word THREE with caps for the bot/human on the other end" [but this didn't work]
User reports software on GitHub that has malware installer three times and GitHub does nothing
I used linkedin for recruiting, which involved (manually) looking at people's profiles and was threatened with a ban for looking at too many profiles
- The message says you should contact support "if you think this was in error", but this returns a response that's either fully automated or might as well be and appears to do nothing
Gary Bernhardt spends 5 days trying to get Azure to take down phishing sites, which did nothing
- Gary has 40k Twitter followers, so he tweeted about it, which got the issue fixed after a couple of days. Gary says "No wonder the world is so full of scams if this is the level of effort it takes to get Microsoft to take down a single phishing site hosted on their infrastructure".
Spammer spams GitHub repos with garbage issues and PRs for months
- After I made this viral Mastodon thread about this which also made it to the front page of HN, one of the two accounts was suspended, but when I checked significantly later, the other was still around and spamming
  - I did not report this account because I reported a blatant troll account (which I know was banned from Twitter and lobsters for trolling) and got no action, and I've seen many other people indicate that they find GitHub reporting to be useless, which seems to have been the case here; one person noted that, before my viral thread, they had already blocked the account from a repo they're a maintainer and didn't bother to report because of GitHub's bad reporting flow
Microsoft incorrectly marks many blogs as spam, banning them from Bing as well as DuckDuckGo
- Fixed sometime after a post about this went viral
GitHub Copilot emits GPL code
Windows puts conspiracy theory articles and other SEO spam into search menu
Microsoft bans people using prompt injections on BingGPT
User finds way to impersonate signed commits from any user because GitHub uses regexes instead of a real parser and has a bug in their regex
- Bug report is initially closed as "being able to impersonate your own account is not an issue", by someone who apparently didn't understand the issue
- After the user pings the issue a couple more times, the issue is eventually re-opened and fixed after a couple months, so this is at least better than the other GitHub cases we've seen, where someone has to make a viral Twitter thread to get the issue fixed
- In the HN comments for the story, someone notes that GitHub is quick to close security issues that they don't seem to have looked closely at
User is banned from GitHub after authorizing a shady provider with a login
- Of course this has the standard comments blaming the user, but people note that the "log in with GitHub" prompt and the "grant this service to act on your behalf" prompt look almost identical; even so, people keep responding with comments like "dont bother wasting anymore resources to protect the stupids"
Activision's RICOCHET anti-cheat software is famous for having a high false positive rate, banning people from games they paid for (this also bans people from playing "offline" in single-player mode)
- User had their game crash 8 times in a row due to common problems (many people reported crashes with the version that crashed for this user), which apparently triggered some kind of false positive in anti-cheat software
- Support goes well beyond what most companies respond with, and responds with "Any specifics regarding the ban will not be released in order to help maintain the integrity and security of the game, this is company policy that will not be changing."
- Since this software is famous for being inaccurate and having a high false positive rate, there are a huge number of accounts of false bans, such as this one. In order to avoid doubling the length of this post, I won't list these
Relatively rare case of user figuring out why they were incorrectly banned by Activision and getting their account restored
- Of course support was useless as always and trying to get help online just resulted in getting a lot of comments blaming the user for cheating
- User was banned because, after Activision and Blizzard were merged, their Blizzard username (which contains the substring "erotica") became illegal, causing them to be banned by Activision's systems. But, unlike a suspension for an illegal username in Blizzard's system, Activition's system doesn't tell you that have an illegal username and just bans you
- Luckily, the user was able to find a single reddit post by someone who had a similar issue and that post had a link that lets you log into the account system even if you're banned, which then lets you change your username
- Three days after making that change, the user was unbanned
User who bought Activision game to play in single-player campaign mode only banned for cheating after trying to launch/play game on Linux through Wine/Proton
- Support gave user the runaround and eventually stopped responding, so user appears to be permanently banned
Anti-"cheat" software bans users before they can even try playing the game
- Someone speculates that it could be due to buying refurbished hardware, since Activision bans based on hardware serial numbers and some people were banned because they bought SSDs from banned machines
Anti-"cheat" software bans user from Bungie (Activision) game for no discernable reason; user speculates it might be because AutoHotkey to script Windows (for out of game activities)
Minecraft user banned for 7 days for making sign that says Nigel on their mom's realm (server, basically?); other users report that creating or typing something with the substring "nig" is dangerous
- See also, offensive words in Minecraft
Microsoft Edge incorrectly blocks a page as being suspicious
- Developer tries to appeal, but is told that they need to send a link to a URL for the support person to look at, which is impossible because it's an API server that has no pages. Support does not reply to this.
User banned from WoW for beating someone playing with 60 accounts, who submits 60 false reports against user; people report this kind of issue in Overwatch as well, where mass reporting someone is an easy way to get Blizzard to suspend or ban their account
User suspends user from WoW for not renaming their pet from its default name of "Gorilla", which was deemed to be offensive

Stripe

Turns off account for a business that's been running since 2016 with basically the same customers. After a week of talking to tech support, the account is reactivated and then, shortly afterwards, 35% of client accounts get turned off. Account reactivated after story got 1k points on HN
Stripe holds $400k from account and support just gives developer the runaround for a month
- Support asks for 2 receipts and then, after they're sent, asks for the same two receipts again, etc.
- As usual, HN commenters blame the developer and make up reasons that the developer might be bad, e.g., people say that the developer might be committing fraud. From a quick skim, at least five people called the developer's story fake or said or implied that the developer was involved in some kind of shady activity
- Shortly after the original story made HN, Stripe resolved the issues and unlocked the accounts, so the standard responses that the developer must be doing something fraudulent were wrong again; a detailed accounting of what happens makes it clear that nothing about Stripe's response, other than the initial locking for potential fraud, was remotely reasonable
  - The developer notes that Stripe support was trying to stonewall them until they pointed out that there was a high-ranking HN post about the issue: "Dec 30 [over one month from the initial freezing of funds]: While I was writing my HN post I was also on chat with Stripe for over an hour. No new information. They were basically trying to shut down the chat with me until I sent them the HN story and showed that it was getting some traction. Then they started working on my issue again and trying to communicate with more people. No resolution."
- After the issue was resolved, the developer was able to get information from Stripe about why the account was locked; the reason was that the company had a spike in sales due to Black Friday. Until the issue hit the top of HN, the developer was not able to talk to any person at Stripe who was useful in any way
Developer at SaaS for martial arts academies in Europe notes that some new anti-fraud detection seems to be incorrectly suspending accounts; their academies have their own accounts and multiple got suspended
- These stories are frequent enough that someone responds "Monthly HN Stripe customer support thread", to which the moderator of HN responds that it's more than monthly and HN will probably have to do something about this at some point, since having the HN front page be Stripe support on a regular basis is a bit much
  - Doing a search now, there are still plenty of support horror stories, but they typically don't get many votes and don't have Stripe staff drop in to fix the issue, so it seems that this support channel no longer works as well as it used to.
- Multiple people point out issues in how Stripe handles SEPA DD and other users of Stripe note that they're impacted by this as well
- Of course, this gets the usual responses that we need to see both sides of the story, maybe you're wrong and Stripe is right, etc.; the developer responds to one of these with an apology for their error
After account was approved and despite many tests using the Stripe test environment, on launch, it turns out that the account wasn't approved after all and payments couldn't go through. Some people say that you should send real test payments before launch, but someone notes that using real CC info for tests and not the Stripe test stuff is a violation of Stripe's terms
Stripe user notes that Stripe fraud detection completely fails on some trivial attacks, writes about it after seeing it hit them as well as many other Stripe users
- Developer describes the support they received as "a joke" since they had to manually implement rules to block the clearly fraudulent charges
- A stripe developer replies and says they'll look into it after two threads on this go viral
Shut down payments for job board; seems to have been re-activated after Twitter thread
Turned off company without warning; company moved to Parallel Economy
Wording of Stripe's renewal email causes users of service to think that you have to email the service to cancel; issue gets no action for a year, until Gary Bernhardt publicly tweets about it
User has Stripe account closed for no discernable reason
Stripe user has money taken out of their account
- A Stripe employee responds with "we wouldn't do so without cause", implying that Stripe's error rate is 0%
Stripe arbitrarily shuts down user's business
- Payments restored after story goes viral on HN
  - This happens so frequently that multiple people comment on how often this happens and how this seems to be the only way to get support from Stripe for some business-killing issues
Developer notes that Stripe fraud detection totally failed when people do scams via CashApp
- Another developer replies and notes that it's weird that you can block prepaid cards but not CashApp when CashApp is used for so much scamming
Developer has payments frozen, initially because account was misclassified and put into the wrong risk category
- Developer notes that suspension is particularly problematic because a "minimum fee commitment" with Stripe where they get a discount but also have a fee floor regardless of transaction volume; having payments suspending effectively increases their rate
- After one week, their account was unfrozen, but then another department froze their account, " this time by a different Stripe risk team with even weirder demands: among other things, they wanted a 'working website' (our website works?) and 'contact information to appear on the website' (it's on every page?) It was as if Stripe had never heard of or talked to us before, and just like the other risk team, they asked questions but didn't respond to our emails."
- This also got resolved, but new teams keep freezing their account, causing the developer to go through a similar process again each time
- Fed up with this, the developer made an HN post which got enough upvotes that the Stripe employee who handles HN escalations saw the post
- Of course, someone has the standard response that this must be be the user/developer's fault, it must be because the business is shady or high risk, one that typically gets banned from payment processors, but if that's the case, that makes this example even worse — why would Stripe negotiate a minimum fee agreement with a business they expect to ban and how come the business keeps getting unbanned each time after someone bans them
  - Also, multiple people report having or seeing similar experiences, "I find it totally believable after having to work through multiple internal risk teams to get my test accounts past automated flaggers", etc
Stripe suspends account and support doesn't respond when suer wants to know why
- User notes that they can't even refund their users: "when I attempted to process a refund for a customer who had been injured & was unable to continue training, I get an error message stating I am unable to process refunds! Am I supposed to tell my customer that my payment process won't refund his money? FYI - The payment I am attempting to refund HAS NOT been paid out yet - the money is sitting in my stripe account - but they refuse to refund it or even dignify me with a response."
- Many people comment on bad Stripe support has been for them, even this happy customer: "We’re using stripe and are overall happy. But their customer support is pretty bad. Lots of canned replies and ping-pong back and forth until you get someone to actually read your question"
Stripe account suspended due to a lien; after the lien is removed, Stripe doesn't unsuspend the account and the account is still frozen
- Luckily, the son of the user is a programmer who knows someone at Stripe, so the issue gets fixed
Developer's Stripe account is suspended with a standard message seen in other Stripe issues, "Our systems recently identified charges that appear to be unauthorized by the customer, meaning that the owner of the card or bank account did not consent to these payments. This unfortunately means that we will no longer be able to accept payments ... "
- Developer pressed some button to verify their identity, which resulted in "Thank you for completing our verification process. Unfortunately, after conducting a further review of your account, we’ve determined that we still won't be able to accept payments for xx moving forward". They then tried to contact support, which did nothing
- After their HN post hits the front page, someone looks into it and it appears that the issue is fixed and the developer gets an email which reads "It looks like some activity on your account was misidentified as unauthorized, causing potential charge declines and the cancellation of your account. This was a mistake on our end, so we've gone ahead and re-enabled your account." The developer notes that having support not respond until you can get a front-page HN post is a poor support strategy for users and that they lost credit card renewals during the time the account was suspended
Developer has product launch derailed when Stripe suspends their account for no discernable reason; they try talking to support which doesn't work
- What does work is posting a comment on a front-page HN thread about someone else's Stripe horror story, which becomes the top comment, which causes a Stripe employee to look at the issue and unsuspend the account
Stripe bans developer for having too many disputes when they've had one dispute, which was a $10 charge where they won on submitting evidence about the dispute to J.P. Morgan, the user's bank
- The developer appeals and receives a message saying that they're "after further conducting a review of your account, we've determined that we still won't be able to accept payments ... going forward. Stripe can only support users with a low risk of customer disputes. After reviewing your submitted information and website, it does seem like your business presents a higher level of risk than we can currently support"
- After the story hits #1 on HN, their account is unbanned, but then a day later, it's rebanned for a completely different reason!
Developer banned from Stripe; they aren't sure why, but wonder if it's because they're a Muslim charity
User, who appears to be a non-technical small business owner has their Stripe account suspended, which also disabled the "call for help" button or any other method of contacting support
- After six weeks, they find HN and make a post on HN, which gets the attention of someone at Stripe, and they're able to get their information out of Stripe and move to another payment processor, though they mention that they lose 6 weeks of revenue and close with "please..do better. You're messing with people's livelihoods"
Developer notes that the only way they've been able to get Stripe issues resolved is by searching LinkedIn for connections at Stripe, because support just gives you the runaround
User (not using Stripe as a service, someone being charged by Stripe) gets fraudulently charged every month and issues a chargeback every month
- To stop having to issue a chargeback each month, user is stuck in a support loop where Stripe tells them to contact the credit card company and the credit card company tells them to contact Stripe t
  - Stripe support also responds nonsensically sometimes, e.g., responding and asking if they need help resetting their password
Developer notes that Stripe's "radar" fraud detection product misses extremely simple fraudulent cases, such as "Is it a 100th charge coming from the same IP in Ukraine with a Canadian VISA", or "Same fake TLD for the email address, for a customer number 2235", so they use broad rules to reject fraudulent charges, but this also rejects many good charges and causes a loss of revenue

Uber

Former manager of payments team banned from Uber due to incorrect fraud detection
- engineer spends six months trying to get it fixed and it's eventually fixed via adding a whitelist that manually unbans the former manager of the payments team, but the underlying issue isn't fixed
UberEats driver has accounted deactivated for not delivering drugs
- Driver originally contacted Uber support, who told them to contact the police. The police confirmed that the package contained crack cocaine
- The next day, Uber support called the driver and asked what happened. After the driver explained, support told them they would report the package as having being delivered to the police
- Shortly afterwards, driver's account was deactivated for not delivering the drugs
Despite very clear documentation that UberEats delivered the wrong order, Uber refuses to refund user
User has account put into a degraded state because user asked for too many refunds on bad or missing items
User has account put into a degraded state because user asked for refund on missing item
I often wonder if the above will happen to me. My local grocery store delivers using DoorDash and, most of the time, at least one item is missing (I also often get items I didn't order that I don't want); either the grocery store or the driver (or both) seem to frequently accidentally co-mingle items for different customers, resulting in a very high rate of errors
Asking for refund on order with error puts account into degraded state
Uber refuses to refund item that didn't arrive on UberEats, until user threatens to send evidence of non-refund to UK regulatory agency, at which point Uber issues the refund
Uber refuses to refund user when Uber Eats driver clearly delivered the wrong item; item isn't even an Uber delivery (it's a DoorDash delivery)
A friend of mine had an Uber stop in the wrong place (it was multiple blocks away and this person ordered an Uber to pick them up from a medical appointment because they're severely concussed, so much so that walking is extremely taxing for them); the driver refused to come to them, so they agreed to go to the driver. When they were halfway there, the driver canceled the order in a way that caused my friend to pay a cancellation fee
User receives a 6 pack instead of 12 pack from Uber Eats and customer service declines to refund the difference
- In the comments, other people report having the same issue
Uber refuses to issue refund when stolen phone (?) $2000 in charges; user only gets money back via chargeback and presumably gets banned for life, as is standard when issuing chargebacks to big tech companies
UberEats refuses to issue refund when order is clearly delivered to wrong location (half a mile away)
Early in the pandemic, Uber said they would compensate drivers if they got covid, but they refuse to do this
- After 6 months of having support deny their request, Uber gives them half of the compensation that was promised
UberEats punishes driver for "shop and pay" order when item is out of stock and cannot be purchased
Disabled user orders groceries from UberEats; when order is delivered to the wrong building, support won't have item actually delivered
User can't cancel UberEats order when restaurant doesn't make order, leaving order in limbo
- The restaurant is closed and support responds saying that they can't do anything about it and the user needs to go to cancel in their app, but going to cancel in their app forwards them to chat with the support person who says that they need to go cancel in their app; after more discussion, support tells them that their order, which they know was never delivered, is not eligible for a refund
Uber maps routes driver down impossible route and a user indicates that reporting this issue does nothing
UberEats driver notes that reporting that someone stole an order from the restaurant is pointless because the order just gets sent back out to another driver anyway; contacting support is useless and costs you valuable time that you could be using to earn money
- Another driver reports the same thing
Two people scammed Uber Eats out of $1M
Uber drivers at a local airport cancel ride if fare is under $100
Uber driver suddenly has account blocked
- They find out that it's because a passenger reported an item lost; passenger later realizes they had the item all along and driver is unblocked, but driver was 2 hours from home and had to do the 2 hour drive home without being able to pick up fares
User reports that UberEats delivery had slugs in it and Uber does their now-standard move of not issuing a refund; they issue a refund after the post about this goes viral
User reports that UberEats driver spilled drink, with clear evidence of this and Uber refuses to refund until after a thread about it goes viral and the user complains on Twitter
Uber refuses to refund UberEats pizza delivery that never showed up; user indicates that they'd never contacted support before and had never asked for a refund before
Driver threatened with ban from Uber and is unable to get any kind of remotely reasonable response until his union, the App Drivers and Couriers Union worked with the Worker Info Exchange to go after Uber in court; "Just before the case came to court, Uber apologised and acknowledged it had made a mistake"
- Uber's issues an official response of "We are disappointed that the court did not recognise the robust processes we have in place, including meaningful human review, when making a decision to deactivate a driver’s account due to suspected fraud."
User notes that Uber drivers often do scams and that Uber doesn't seem to prevent this
User notes that they frequently get scammed by Uber drivers, but Uber usually refunds them when they report the scam to Uber
User notes that Uber drivers try to scam the ~1/20 times
User blocked from UberEats refunds after too many orders got screwed up and the user got too many refunds; user sent photos of the messed up orders, but Uber support doesn't care about what happened, just the size of and number of refunds
User's uber account blocked for no discernable reason
- At the time, there was no way to contact support, but the user tries again after a few years, at which point there is a support form, but that still doesn't work User's wife is incorrectly banned from user. User worked at Uber for four years and knows that the only way to get this fixed is to escalate to employees they know because, as a former employee, they know how useless support is
User tries to get support for UberEATS not honoring a buy 1 get 1 free deal; support didn't help
Former Uber engineer notes that people randomly get banned for false positives
User had some Uber driver create an account with their email despite them never verifying the email with this new account; user tried to have their email removed, but support says they can't discuss anything about the account since they're not the account owner
User's wife banned from Uber for no discernible reason
User's Uber account gets into apparently unfixable unexpected state

Cloudflare

Blind user reports that Cloudflare doesn't have audio captchas, making much of the web unusable
- Person that handles Cloudflare captchas invites the user to solve an open research problem so that they're able to browse websites hosted on Cloudflare
Cloudflare suspends user's account with no warning or information
- contacting CloudFlare results in the a response of "Your account violated our terms of service specifically fraud. The suspension is permanent and we will not be making changes on our end."
- account restored after viral HN thread
User finds much of the internet unusable with Firefox due to Cloudflare CAPTCHA infinite loop (switching to Chrome allows them to browse the internet); user writes up a detailed account of the issue and their issue is auto-closed (and other people report the exact same experience)
- Same issue, different user
- Same issue, different user
  - Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Same issue, different user
- Similar issue, but with Brave instead of Firefox
- Standard response of "why use the product if it does this bad thing?"
  - People point out that, as usual, this standard response is nonsense because, just for example, government websites that some people need to interact with sometimes use Cloudflare
- After the story hits the front page of HN, a cloudflare exec replies and says people will look into it and then one person reports that the issue is fixed for them; I found tens of people who said that they reported the issue to Cloudflare, so I would guess that, overall, thousands of people reported the issue to cloudflare, which did nothing until someone wrote a post that hit the HN front page.
Cloudflare takes site down due to what appears to be incorrect malware detection
Cloudflare blocks transfer of domains in cases of incorrect fraud detection
Incorrect phishing detection results in huge phishing warning on website
Incorrect phishing detection results in URL being blocked
- This was apparently triggered by a URL that hadn't existed for 10 years?
User can't access any site on cloudflare because cloudflare decided they're malicious
User can't access any site on cloudflare and some other hosts, they believe because another user on their ISP had malware on their computer
Cloudflare blocks some HN comments
- Users do a bit of testing and find that the blocking is fairly arbitrary
User is blocked by Cloudflare and can no longer visit many (all?) sites that are behind Cloudflare when using Firefox
- In the comments, on the order of 10 users note they've run into the same problem. The article is highly upvoted and a Cloudflare PM looks into it (resolution unknown)
RSS feeds blocked because Cloudflare detects RSS client isn't a browser with a real human directly operating it
User from Hong Kong finds that they often have to use a VPN to access sites on Cloudflare because Cloudflare thinks their IP is bad
User finds a large fraction of the internet unusable due to Cloudflare infinite browser check loop
User finds a large fraction of the internet unusable because Cloudflare has decided their IP is suspicious
- User changes ISPs in order to be able to use the internet again ### TODO example in main body
Security researcher finds security flaw in Cloudflare
- Researcher claims that afterwards, "Cloudflare literally lobbied the FTC to investigate me and question the legality of openly discussing security research"
Cloudflare ia haven for scammers and copyright thieves

Shopify

Having a store located in Puerto Rico causes payouts to be suspended every 3-4 months to verify address
Kafkaesque support nightmare after payouts suspended
- bizarre requirements, such as proving the bookstore has a license to sell the books they're selling

Twitter (X)

I dropped most of the Twitter stories since there are so many after the acquisition that it seems silly to list them, but I've kept a few random ones.

Apple

Apple ML identifies user as three different people, depending on hairstyle
Long story about Apple removing an app from the app store
Rampant, easy to find, scam/spam apps on app store
- A developer asks, how is it that so many legitimate apps get banned taken down from the app store for bad reasons when so much blatant spam gets through?
  - Lots of stories of legitimate apps getting autorejected immediately on submission, often requiring jumping through nonsensical hoops to get the app reinstated
Apple forces developer to remove app for being too similar to another one of their apps because they have localized versions of their apps for different geos; developer asks how come people with essentially identical apps can keep 400 basically identical apps up?
Searches for apps in various basic categories return scams and random puzzle games (in non-game categories)
User makes an app that lets you read HN; Apple store repeatedly rejects app for reasons that don't make sense given what the app does, but support fails to understand the explanation
- Luckily, it's Apple and not Google and they eventually manage to get a human on the phone, who understands the verbal explanation

DoorDash

Driver can't contact customer, so DoorDash support tells driver to dump food in parking lot
DoorDash driver says they'll only actually deliver the item if the user pays them $15 extra
The above is apparently not that uncommon scam as a friend of mine had this happen to them as well
DoorDash refuses refund for item that didn't arrive
- Of course, people have the standard response of "why don't you stop using these crappy services?" (the link above this one is also full of these) and some responds, "Because I'm disabled. Don't have a driver's license or a car. There isn't a bus stop near my apartment, I actually take paratransit to get to work, but I have to plan that a day ahead. Uber pulls the same shit, so I have to cycle through Uber, Door dash, and GrubHub based on who has coupons and hasn't stolen my money lately. Not everyone can just go pick something up."
At one point, after I had a few bad deliveries in a row and gave a few drivers low ratings (I normally give people a high rating unless they don't even attempt to deliver to my door), I had a driver who took a really long time to deliver who, from watching the map, was just driving around. With my rating, I wrote a note that said that it appeared that, from the route, the driver was multi-apping, at which point DoorDash removed my ability to rate drivers, so I switched to Uber

Walmart

Driver steals delivery order; Walmart support does nothing and user has to drive to Walmart store to get issue fixed, but this is actually possible, unlike with most tech companies
- Walmart employee notes that delivery is subcontracted out, with no real feedback mechanism
Delivery doesn't arrive and user is unable to get refund
Walmart refuses to refund user when they're charged the wrong price

Airbnb

I've seen a ton of these but, for some reason, it didn't occur to me to add them to my list, so I don't have a lot of examples even though I've probably seen three times as many of these as I've seen Uber horror stories.

AirBnB had cameras in the bathroom and bedroom and support refused to refund user
AirBnB refuses to issue refund of scam booking to stolen credit card; user has to issue chargeback and (as is standard) presumably gets their account banned for life
User finds cameras in AirBnB that cover sleeping areas and other private areas and AirBnB says they'll refund user as user books a hotel and then refuses to refund user
- User is a tenacious lawyer and goes through arbitration to get a refund, which takes a large amount of effort and almost an entire year (dates in 1st level reddit link from above appear to be wrong if dates are correct in subsequent links)

Appendix: Jeff Horwitz's Broken Code

Below are a few relevant excerpts. This is intended to be analogous to Zvi Mowshowitz's Quotes from Moral Mazes, which gives you an idea of what's in the book but is definitely not a replacement for reading the book. If these quotes are interesting, I recommend reading the book!

The former employees who agreed to speak to me said troubling things from the get-go. Facebook’s automated enforcement systems were flatly incapable of performing as billed. Efforts to engineer growth had inadvertently rewarded political zealotry. And the company knew far more about the negative effects of social media usage than it let on.
as the election progressed, the company started receiving reports of mass fake accounts, bald-faced lies on campaign-controlled pages, and coordinated threats of violence against Duterte critics. After years in politics, Harbath wasn’t naive about dirty tricks. But when Duterte won, it was impossible to deny that Facebook’s platform had rewarded his combative and sometimes underhanded brand of politics. The president-elect banned independent media from his inauguration—but livestreamed the event on Facebook. His promised extrajudicial killings began soon after.

A month after Duterte’s May 2016 victory came the United Kingdom’s referendum to leave the European Union. The Brexit campaign had been heavy on anti-immigrant sentiment and outright lies. As in the Philippines, the insurgent tactics seemed to thrive on Facebook—supporters of the “Leave” camp had obliterated “Remain” supporters on the platform. ... Harbath found all that to be gross, but there was no denying that Trump was successfully using Facebook and Twitter to short-circuit traditional campaign coverage, garnering attention in ways no campaign ever had. “I mean, he just has to go and do a short video on Facebook or Instagram and then the media covers it,” Harbath had marveled during a talk in Europe that spring. She wasn’t wrong: political reporters reported not just the content of Trump’s posts but their like counts.

Did Facebook need to consider making some effort to fact-check lies spread on its platform? Harbath broached the subject with Adam Mosseri, then Facebook’s head of News Feed.

“How on earth would we determine what’s true?” Mosseri responded. Depending on how you looked at it, it was an epistemic or a technological conundrum. Either way, the company chose to punt when it came to lies on its platform.
Zuckerberg believed math was on Facebook’s side. Yes, there had been misinformation on the platform—but it certainly wasn’t the majority of content. Numerically, falsehoods accounted for just a fraction of all news viewed on Facebook, and news itself was just a fraction of the platform’s overall content. That such a fraction of a fraction could have thrown the election was downright illogical, Zuckerberg insisted.. ... But Zuckerberg was the boss. Ignoring Kornblut’s advice, he made his case the following day during a live interview at Techonomy, a conference held at the Ritz-Carlton in Half Moon Bay. Calling fake news a “very small” component of the platform, he declared the possibility that it had swung the election “a crazy idea.” ... A favorite saying at Facebook is that “Data Wins Arguments.” But when it came to Zuckerberg’s argument that fake news wasn’t a major problem on Facebook, the company didn’t have any data. As convinced as the CEO was that Facebook was blameless, he had no evidence of how “fake news” came to be, how it spread across the platform, and whether the Trump campaign had made use of it in their Facebook ad campaigns. ... One week after the election, BuzzFeed News reporter Craig Silverman published an analysis showing that, in the final months of the election, fake news had been the most viral election-related content on Facebook. A story falsely claiming that the pope had endorsed Trump had gotten more than 900,000 likes, reshares, and comments—more engagement than even the most widely shared stories from CNN, the New York Times, or the Washington Post. The most popular falsehoods, the story showed, had been in support of Trump.

It was a bombshell. Interest in the term “fake news” spiked on Google the day the story was published—and it stayed high for years, first as Trump’s critics cited it as an explanation for the president-elect’s victory, and then as Trump co-opted the term to denigrate the media at large. ... even as the company’s Communications staff had quibbled with Silverman’s methodology, executives had demanded that News Feed’s data scientists replicate it. Was it really true that lies were the platform’s top election-related content?

A day later, the staffers came back with an answer: almost.

A quick and dirty review suggested that the data BuzzFeed was using had been slightly off, but the claim that partisan hoaxes were trouncing real news in Facebook’s News Feed was unquestionably correct. Bullshit peddlers had a big advantage over legitimate publications—their material was invariably compelling and exclusive. While scores of mainstream news outlets had written rival stories about Clinton’s leaked emails, for instance, none of them could compete with the headline “WikiLeaks CONFIRMS Hillary Sold Weapons to ISIS.”
The engineers weren’t incompetent—just applying often-cited company wisdom that “Done Is Better Than Perfect.” Rather than slowing down, Maurer said, Facebook preferred to build new systems capable of minimizing the damage of sloppy work, creating firewalls to prevent failures from cascading, discarding neglected data before it piled up in server-crashing queues, and redesigning infrastructure so that it could be readily restored after inevitable blowups.

The same culture applied to product design, where bonuses and promotions were doled out to employees based on how many features they “shipped”—programming jargon for incorporating new code into an app. Conducted semiannually, these “Performance Summary Cycle” reviews incented employees to complete products within six months, even if it meant the finished product was only minimally viable and poorly documented. Engineers and data scientists described living with perpetual uncertainty about where user data was being collected and stored—a poorly labeled data table could be a redundant file or a critical component of an important product. Brian Boland, a longtime vice president in Facebook’s Advertising and Partnerships divisions, recalled that a major data-sharing deal with Amazon once collapsed because Facebook couldn’t meet the retailing giant’s demand that it not mix Amazon’s data with its own.

“Building things is way more fun than making things secure and safe,” he said of the company’s attitude. “Until there’s a regulatory or press fire, you don’t deal with it.”
Nowhere in the system was there much place for quality control. Instead of trying to restrict problem content, Facebook generally preferred to personalize users’ feeds with whatever it thought they would want to see. Though taking a light touch on moderation had practical advantages—selling ads against content you don’t review is a great business—Facebook came to treat it as a moral virtue, too. The company wasn’t failing to supervise what users did—it was neutral.

Though the company had come to accept that it would need to do some policing, executives continued to suggest that the platform would largely regulate itself. In 2016, with the company facing pressure to moderate terrorism recruitment more aggressively, Sheryl Sandberg had told the World Economic Forum that the platform did what it could, but that the lasting solution to hate on Facebook was to drown it in positive messages.

“The best antidote to bad speech is good speech,” she declared, telling the audience how German activists had rebuked a Neo-Nazi political party’s Facebook page with “like attacks,” swarming it with messages of tolerance.

Definitionally, the “counterspeech” Sandberg was describing didn’t work on Facebook. However inspiring the concept, interacting with vile content would have triggered the platform to distribute the objectionable material to a wider audience.
... in an internal memo by Andrew “Boz” Bosworth, who had gone from being one of Mark Zuckerberg’s TAs at Harvard to one of his most trusted deputies and confidants at Facebook. Titled “The Ugly,” Bosworth wrote the memo in June 2016, two days after the murder of a Chicago man was inadvertently livestreamed on Facebook. Facing calls for the company to rethink its products, Bosworth was rallying the troops.

“We talk about the good and the bad of our work often. I want to talk about the ugly,” the memo began. Connecting people created obvious good, he said—but doing so at Facebook’s scale would produce harm, whether it was users bullying a peer to the point of suicide or using the platform to organize a terror attack.

That Facebook would inevitably lead to such tragedies was unfortunate, but it wasn’t the Ugly. The Ugly, Boz wrote, was that the company believed in its mission of connecting people so deeply that it would sacrifice anything to carry it out.

“That’s why all the work we do in growth is justified. All the questionable contact importing practices. All the subtle language that helps people stay searchable by friends. All of the work we do to bring more communication in. The work we will likely have to do in China some day. All of it,” Bosworth wrote.
Every team responsible for ranking or recommending content rushed to overhaul their systems as fast as they could, setting off an explosion in the complexity of Facebook’s product. Employees found that the biggest gains often came not from deliberate initiatives but from simple futzing around. Rather than redesigning algorithms, which was slow, engineers were scoring big with quick and dirty machine learning experiments that amounted to throwing hundreds of variants of existing algorithms at the wall and seeing which versions stuck—which performed best with users. They wouldn’t necessarily know why a variable mattered or how one algorithm outperformed another at, say, predicting the likelihood of commenting. But they could keep fiddling until the machine learning model produced an algorithm that statistically outperformed the existing one, and that was good enough.
... in Facebook’s efforts to deploy a classifier to detect pornography, Arturo Bejar recalled, the system routinely tried to cull images of beds. Rather than learning to identify people screwing, the model had instead taught itself to recognize the furniture on which they most often did ... Similarly fundamental errors kept occurring, even as the company came to rely on far more advanced AI techniques to make far weightier and complex decisions than “porn/not porn.” The company was going all in on AI, both to determine what people should see, and also to solve any problems that might arise.
Willner happened to read an NGO report documenting the use of Facebook to groom and arrange meetings with dozens of young girls who were then kidnapped and sold into sex slavery in Indonesia. Zuckerberg was working on his public speaking skills at the time and had asked employees to give him tough questions. So, at an all-hands meeting, Willner asked him why the company had allocated money for its first-ever TV commercial—a recently released ninety-second spot likening Facebook to chairs and other helpful structures—but no budget for a staffer to address its platform’s known role in the abduction, rape, and occasional murder of Indonesian children.

Zuckerberg looked physically ill. He told Willner that he would need to look into the matter ... Willner said, the company was hopelessly behind in the markets where she believed Facebook had the highest likelihood of being misused. When she left Facebook in 2013, she had concluded that the company would never catch up.
Within a few months, Facebook laid off the entire Trending Topics team, sending a security guard to escort them out of the building. A newsroom announcement said that the company had always hoped to make Trending Topics fully automated, and henceforth it would be. If a story topped Facebook’s metrics for viral news, it would top Trending Topics.

The effects of the switch were not subtle. Freed from the shackles of human judgment, Facebook’s code began recommending users check out the commemoration of “National Go Topless Day,” a false story alleging that Megyn Kelly had been sacked by Fox News, and an only-too-accurate story titled “Man Films Himself Having Sex with a McChicken Sandwich.”

Setting aside the feelings of McDonald’s social media team, there were reasons to doubt that the engagement on that final story reflected the public’s genuine interest in sandwich-screwing: much of the engagement was apparently coming from people wishing they’d never seen such accursed content. Still, Zuckerberg preferred it this way. Perceptions of Facebook’s neutrality were paramount; dubious and distasteful was better than biased.

“Zuckerberg said anything that had a human in the loop we had to get rid of as much as possible,” the member of the early polarization team recalled.

Among the early victims of this approach was the company’s only tool to combat hoaxes. For more than a decade, Facebook had avoided removing even the most obvious bullshit, which was less a principled stance and more the only possible option for the startup. “We were a bunch of college students in a room,” said Dave Willner, Charlotte Willner’s husband and the guy who wrote Facebook’s first content standards. “We were radically unequipped and unqualified to decide the correct history of the world.”

But as the company started churning out billions of dollars in annual profit, there were, at least, resources to consider the problem of fake information. In early 2015, the company had announced that it had found a way to combat hoaxes without doing fact-checking—that is, without judging truthfulness itself. It would simply suppress content that users disproportionately reported as false.

Nobody was so naive as to think that this couldn’t get contentious, or that the feature wouldn’t be abused. In a conversation with Adam Mosseri, one engineer asked how the company would deal, for example, with hoax “debunkings” of manmade global warming, which were popular on the American right. Mosseri acknowledged that climate change would be tricky but said that was not cause to stop: “You’re choosing the hardest case—most of them won’t be that hard.”

Facebook publicly revealed its anti-hoax work to little fanfare in an announcement that accurately noted that users reliably reported false news. What it omitted was that users also reported as false any news story they didn’t like, regardless of its accuracy.

To stem a flood of false positives, Facebook engineers devised a workaround: a “whitelist” of trusted publishers. Such safe lists are common in digital advertising, allowing jewelers to buy preauthorized ads on a host of reputable bridal websites, for example, while excluding domains like www.wedddings.com. Facebook’s whitelisting was pretty much the same: they compiled a generously large list of recognized news sites whose stories would be treated as above reproach.

The solution was inelegant, and it could disadvantage obscure publishers specializing in factual but controversial reporting. Nonetheless, it effectively diminished the success of false viral news on Facebook. That is, until the company faced accusations of bias surrounding Trending Topics. Then Facebook preemptively turned it off.

The disabling of Facebook’s defense against hoaxes was part of the reason fake news surged in the fall of 2016.
Gomez-Uribe’s team hadn’t been tasked with working on Russian interference, but one of his subordinates noted something unusual: some of the most hyperactive accounts seemed to go entirely dark on certain days of the year. Their downtime, it turned out, corresponded with a list of public holidays in the Russian Federation.

“They respect holidays in Russia?” he recalled thinking. “Are we all this fucking stupid?”

But users didn’t have to be foreign trolls to promote problem posts. An analysis by Gomez-Uribe’s team showed that a class of Facebook power users tended to favor edgier content, and they were more prone to extreme partisanship. They were also, hour to hour, more prolific—they liked, commented, and reshared vastly more content than the average user. These accounts were outliers, but because Facebook recommended content based on aggregate engagement signals, they had an outsized effect on recommendations. If Facebook was a democracy, it was one in which everyone could vote whenever they liked and as frequently as they wished. ... hyperactive users tended to be more partisan and more inclined to share misinformation, hate speech, and clickbait,
At Facebook, he realized, nobody was responsible for looking under the hood. “They’d trust the metrics without diving into the individual cases,” McNally said. “It was part of the ‘Move Fast’ thing. You’d have hundreds of launches every year that were only driven by bottom-line metrics.”

Something else worried McNally. Facebook’s goal metrics tended to be calculated in averages.

“It is a common phenomenon in statistics that the average is volatile, so certain pathologies could fall straight out of the geometry of the goal metrics,” McNally said. In his own reserved, mathematically minded way, he was calling Facebook’s most hallowed metrics crap. Making decisions based on metrics alone, without carefully studying the effects on actual humans, was reckless. But doing it based on average metrics was flat-out stupid. An average could rise because you did something that was broadly good for users, or it could go up because normal people were using the platform a tiny bit less and a small number of trolls were using Facebook way more.

Everyone at Facebook understood this concept—it’s the difference between median and mean, a topic that is generally taught in middle school. But, in the interest of expediency, Facebook’s core metrics were all based on aggregate usage. It was as if a biologist was measuring the strength of an ecosystem based on raw biomass, failing to distinguish between healthy growth and a toxic algae bloom.
One distinguishing feature was the shamelessness of fake news publishers’ efforts to draw attention. Along with bad information, their pages invariably featured clickbait (sensationalist headlines) and engagement bait (direct appeals for users to interact with content, thereby spreading it further).

Facebook already frowned on those hype techniques as a little spammy, but truth be told it didn’t really do much about them. How much damage could a viral “Share this if you support the troops” post cause?
Facebook’s mandate to respect users’ preferences posed another challenge. According to the metrics the platform used, misinformation was what people wanted. Every metric that Facebook used showed that people liked and shared stories with sensationalistic and misleading headlines.

McNally suspected the metrics were obscuring the reality of the situation. His team set out to demonstrate that this wasn’t actually true. What they found was that, even though users routinely engaged with bait content, they agreed in surveys that such material was of low value to them. When informed that they had shared false content, they experienced regret. And they generally considered fact-checks to contain useful information.
every time a well-intentioned proposal of that sort blew up in the company’s face, the people working on misinformation lost a bit of ground. In the absence of a coherent, consistent set of demands from the outside world, Facebook would always fall back on the logic of maximizing its own usage metrics.

“If something is not going to play well when it hits mainstream media, they might hesitate when doing it,” McNally said. “Other times we were told to take smaller steps and see if anybody notices. The errors were always on the side of doing less.” ... “For people who wanted to fix Facebook, polarization was the poster child of ‘Let’s do some good in the world,’ ” McNally said. “The verdict came back that Facebook’s goal was not to do that work.”
When the ranking team had begun its work, there had been no question that Facebook was feeding its users overtly false information at a rate that vastly outstripped any other form of media. This was no longer the case (even though the company would be raked over the coals for spreading “fake news” for years to come).

Ironically, Facebook was in a poor position to boast about that success. With Zuckerberg having insisted throughout that fake news accounted for only a trivial portion of content, Facebook couldn’t celebrate that it might be on the path of making the claim true.
multiple members of both teams recalled having had the same response when they first learned of MSI’s new engagement weightings: it was going to make people fight. Facebook’s good intent may have been genuine, but the idea that turbocharging comments, reshares, and emojis would have unpleasant effects was pretty obvious to people who had, for instance, worked on Macedonian troll farms, sensationalism, and hateful content.

Hyperbolic headlines and outrage bait were already well-recognized digital publishing tactics, on and off Facebook. They traveled well, getting reshared in long chains. Giving a boost to content that galvanized reshares was going to add an exponential component to the already-healthy rate at which such problem content spread. At a time when the company was trying to address purveyors of misinformation, hyperpartisanship, and hate speech, it had just made their tactics more effective.

Multiple leaders inside Facebook’s Integrity team raised concerns about MSI with Hegeman, who acknowledged the problem and committed to trying to fine-tune MSI later. But adopting MSI was a done deal, he said—Zuckerberg’s orders.

Even non-Integrity staffers recognized the risk. When a Growth team product manager asked if the change meant News Feed would favor more controversial content, the manager of the team responsible for the work acknowledged it very well could.
The effect was more than simply provoking arguments among friends and relatives. As a Civic Integrity researcher would later report back to colleagues, Facebook’s adoption of MSI appeared to have gone so far as to alter European politics. “Engagement on positive and policy posts has been severely reduced, leaving parties increasingly reliant on inflammatory posts and direct attacks on their competitors,” a Facebook social scientist wrote after interviewing political strategists about how they used the platform. In Poland, the parties described online political discourse as “a social-civil war.” One party’s social media management team estimated that they had shifted the proportion of their posts from 50/50 positive/negative to 80 percent negative and 20 percent positive, explicitly as a function of the change to the algorithm. Major parties blamed social media for deepening political polarization, describing the situation as “unsustainable.”

The same was true of parties in Spain. “They have learnt that harsh attacks on their opponents net the highest engagement,” the researcher wrote. “From their perspective, they are trapped in an inescapable cycle of negative campaigning by the incentive structures of the platform.”

If Facebook was making politics more combative, not everyone was upset about it. Extremist parties proudly told the researcher that they were running “provocation strategies” in which they would “create conflictual engagement on divisive issues, such as immigration and nationalism.”

To compete, moderate parties weren’t just talking more confrontationally. They were adopting more extreme policy positions, too. It was a matter of survival. “While they acknowledge they are contributing to polarization, they feel like they have little choice and are asking for help,” the researcher wrote.
Facebook’s most successful publishers of political content were foreign content farms posting absolute trash, stuff that made About.com’s old SEO chum look like it belonged in the New Yorker.

Allen wasn’t the first staffer to notice the quality problem. The pages were an outgrowth of the fake news publishers that Facebook had battled in the wake of the 2016 election. While fact-checks and other crackdown efforts had made it far harder for outright hoaxes to go viral, the publishers had regrouped. Some of the same entities that BuzzFeed had written about in 2016—teenagers from a small Macedonian mountain town called Veles—were back in the game. How had Facebook’s news distribution system been manipulated by kids in a country with a per capita GDP of $5,800?
When reviewing troll farm pages, he noticed something—their posts usually went viral. This was odd. Competition for space in users’ News Feeds meant that most pages couldn’t reliably get their posts in front of even those people who deliberately chose to follow them. But with the help of reshares and the News Feed algorithms, the Macedonian troll farms were routinely reaching huge audiences. If having a post go viral was hitting the attention jackpot, then the Macedonians were winning every time they put a buck into Facebook’s slot machine.

The reason the Macedonians’ content was so good was that it wasn’t theirs. Virtually every post was either aggregated or stolen from somewhere else on the internet. Usually such material came from Reddit or Twitter, but the Macedonians were just ripping off content from other Facebook pages, too, and reposting it to their far larger audiences. This worked because, on Facebook, originality wasn’t an asset; it was a liability. Even for talented content creators, most posts turned out to be duds. But things that had already gone viral nearly always would do so again.
Allen began a note about the problem from the summer of 2018 with a reminder. “The mission of Facebook is to empower people to build community. This is a good mission,” he wrote, before arguing that the behavior he was describing exploited attempts to do that. As an example, Allen compared a real community—a group known as the National Congress of American Indians. The group had clear leaders, produced original programming, and held offline events for Native Americans. But, despite NCAI’s earnest efforts, it had far fewer fans than a page titled “Native American Proub” [sic] that was run out of Vietnam. The page’s unknown administrators were using recycled content to promote a website that sold T-shirts.

“They are exploiting the Native American Community,” Allen wrote, arguing that, even if users liked the content, they would never choose to follow a Native American pride page that was secretly run out of Vietnam. As proof, he included an appendix of reactions from users who had wised up. “If you’d like to read 300 reviews from real users who are very upset about pages that exploit the Native American community, here is a collection of 1 star reviews on Native American ‘Community’ and ‘Media’ pages,” he concluded.

This wasn’t a niche problem. It was increasingly the default state of pages in every community. Six of the top ten Black-themed pages—including the number one page, “My Baby Daddy Ain’t Shit”—were troll farms. The top fourteen English-language Christian- and Muslim-themed pages were illegitimate. A cluster of troll farms peddling evangelical content had a combined audience twenty times larger than the biggest authentic page.

“This is not normal. This is not healthy. We have empowered inauthentic actors to accumulate huge followings for largely unknown purposes,” Allen wrote in a later note. “Mostly, they seem to want to skim a quick buck off of their audience. But there are signs they have been in contact with the IRA.”

So how bad was the problem? A sampling of Facebook publishers with significant audiences found that a full 40 percent relied on content that was either stolen, aggregated, or “spun”—meaning altered in a trivial fashion. The same thing was true of Facebook video content. One of Allen’s colleagues found that 60 percent of video views went to aggregators.

The tactics were so well-known that, on YouTube, people were putting together instructional how-to videos explaining how to become a top Facebook publisher in a matter of weeks. “This is where I’m snagging videos from YouTube and I’ll re-upload them to Facebook,” said one guy in a video Allen documented, noting that it wasn’t strictly necessary to do the work yourself. “You can pay 20 dollars on Fiverr for a compilation—‘Hey, just find me funny videos on dogs, and chain them together into a compilation video.’ ”

Holy shit, Allen thought. Facebook was losing in the later innings of a game it didn’t even understand it was playing. He branded the set of winning tactics “manufactured virality.”

“What’s the easiest (lowest effort) way to make a big Facebook Page?” Allen wrote in an internal slide presentation. “Step 1: Find an existing, engaged community on [Facebook]. Step 2: Scrape/Aggregate content popular in that community. Step 3: Repost the most popular content on your Page.”
Allen’s research kicked off a discussion. That a top page for American Vietnam veterans was being run from overseas—from Vietnam, no less—was just flat-out embarrassing. And unlike killing off Page Like ads, which had been a nonstarter for the way it alienated certain internal constituencies, if Allen and his colleagues could work up ways to systematically suppress trash content farms—material that was hardly exalted by any Facebook team—getting leadership to approve them might be a real possibility.

This was where Allen ran up against that key Facebook tenet, “Assume Good Intent.” The principle had been applied to colleagues, but it was meant to be just as applicable to Facebook’s billions of users. In addition to being a nice thought, it was generally correct. The overwhelming majority of people who use Facebook do so in the name of connection, entertainment, and distraction, and not to deceive or defraud. But, as Allen knew from experience, the motto was hardly a comprehensive guide to living, especially when money was involved.
With the help of another data scientist, Allen documented the inherent traits of crap publishers. They aggregated content. They went viral too consistently. They frequently posted engagement bait. And they relied on reshares from random users, rather than cultivating a dedicated long-term audience.

None of these traits warranted severe punishment by itself. But together they added up to something damning. A 2019 screening for these features found 33,000 entities—a scant 0.175 percent of all pages—that were receiving a full 25 percent of all Facebook page views. Virtually none of them were “managed,” meaning controlled by entities that Facebook’s Partnerships team considered credible media professionals, and they accounted for just 0.14 percent of Facebook revenue.
After it was bought, CrowdTangle was no longer a company but a product, available to media companies at no cost. However much publishers were angry with Facebook, they loved Silverman’s product. The only mandate Facebook gave him was for his team to keep building things that made publishers happy. Savvy reporters looking for viral story fodder loved it, too. CrowdTangle could surface, for instance, an up-and-coming post about a dog that saved its owner’s life, material that was guaranteed to do huge numbers on social media because it was already heading in that direction.

CrowdTangle invited its formerly paying media customers to a party in New York to celebrate the deal. One of the media executives there asked Silverman whether Facebook would be using CrowdTangle internally as an investigative tool, a question that struck Silverman as absurd. Yes, it had offered social media platforms an early window into their own usage. But Facebook’s staff now outnumbered his own by several thousand to one. “I was like, ‘That’s ridiculous—I’m sure whatever they have is infinitely more powerful than what we have!’ ”

It took Silverman more than a year to reconsider that answer.
It was only as CrowdTangle started building tools to do this that the team realized just how little Facebook knew about its own platform. When Media Matters, a liberal media watchdog, published a report showing that MSI had been a boon for Breitbart, Facebook executives were genuinely surprised, sending around the article asking if it was true. As any CrowdTangle user would have known, it was.

Silverman thought the blindness unfortunate, because it prevented the company from recognizing the extent of its quality problem. It was the same point that Jeff Allen and a number of other Facebook employees had been hammering on. As it turned out, the person to drive it home wouldn’t come from inside the company. It would be Jonah Peretti, the CEO of BuzzFeed.

BuzzFeed had pioneered the viral publishing model. While “listicles” earned the publication a reputation for silly fluff in its early days, Peretti’s staff operated at a level of social media sophistication far above most media outlets, stockpiling content ahead of snowstorms and using CrowdTangle to find quick-hit stories that drew giant audiences.

In the fall of 2018, Peretti emailed Cox with a grievance: Facebook’s Meaningful Social Interactions ranking change was pressuring his staff to produce scuzzier content. BuzzFeed could roll with the punches, Peretti wrote, but nobody on his staff would be happy about it. Distinguishing himself from publishers who just whined about lost traffic, Peretti cited one of his platform’s recent successes: a compilation of tweets titled “21 Things That Almost All White People Are Guilty of Saying.” The list—which included “whoopsie daisy,” “get these chips away from me,” and “guilty as charged”—had performed fantastically on Facebook. What bothered Peretti was the apparent reason why. Thousands of users were brawling in the comments section over whether the item itself was racist.

“When we create meaningful content, it doesn’t get rewarded,” Peretti told Cox. Instead, Facebook was promoting “fad/junky science,” “extremely disturbing news,” “gross images,” and content that exploited racial divisions, according to a summary of Peretti’s email that circulated among Integrity staffers. Nobody at BuzzFeed liked producing that junk, Peretti wrote, but that was what Facebook was demanding. (In an illustration of BuzzFeed’s willingness to play the game, a few months later it ran another compilation titled “33 Things That Almost All White People Are Guilty of Doing.”)
As users’ News Feeds became dominated by reshares, group posts, and videos, the “organic reach” of celebrity pages began tanking. “My artists built up a fan base and now they can’t reach them unless they buy ads,” groused Travis Laurendine, a New Orleans–based music promoter and technologist, in a 2019 interview. A page with 10,000 followers would be lucky to reach more than a tiny percent of them.

Explaining why a celebrity’s Facebook reach was dropping even as they gained followers was hell for Partnerships, the team tasked with providing VIP service to notable users and selling them on the value of maintaining an active presence on Facebook. The job boiled down to convincing famous people, or their social media handlers, that if they followed a set of company-approved best practices, they would reach their audience. The problem was that those practices, such as regularly posting original content and avoiding engagement bait, didn’t actually work. Actresses who were the center of attention on the Oscars’ red carpet would have their posts beaten out by a compilation video of dirt bike crashes stolen from YouTube. ... Over time, celebrities and influencers began drifting off the platform, generally to sister company Instagram. “I don’t think people ever connected the dots,” Boland said.
“Sixty-four percent of all extremist group joins are due to our recommendation tools,” the researcher wrote in a note summarizing her findings. “Our recommendation systems grow the problem.”

This sort of thing was decidedly not supposed to be Civic’s concern. The team existed to promote civic participation, not police it. Still, a longstanding company motto was that “Nothing Is Someone Else’s Problem.” Chakrabarti and the researcher team took the findings to the company’s Protect and Care team, which worked on things like suicide prevention and bullying and was, at that point, the closest thing Facebook had to a team focused on societal problems.

Protect and Care told Civic there was nothing it could do. The accounts creating the content were real people, and Facebook intentionally had no rules mandating truth, balance, or good faith. This wasn’t someone else’s problem—it was nobody’s problem.
Even if the problem seemed large and urgent, exploring possible defenses against bad-faith viral discourse was going to be new territory for Civic, and the team wanted to start off slow. Cox clearly supported the team’s involvement, but studying the platform’s defenses against manipulation would still represent moonlighting from Civic’s main job, which was building useful features for public discussion online.

A few months after the 2016 election, Chakrabarti made a request of Zuckerberg. To build tools to study political misinformation on Facebook, he wanted two additional engineers on top of the eight he already had working on boosting political participation.

“How many engineers do you have on your team right now?” Zuckerberg asked. Chakrabarti told him. “If you want to do it, you’re going to have to come up with the resources yourself,” the CEO said, according to members of Civic. Facebook had more than 20,000 engineers—and Zuckerberg wasn’t willing to give the Civic team two of them to study what had happened during the election.
While acknowledging the possibility that social media might not be a force for universal good was a step forward for Facebook, discussing the flaws of the existing platform remained difficult even internally, recalled product manager Elise Liu.

“People don’t like being told they’re wrong, and they especially don’t like being told that they’re morally wrong,” she said. “Every meeting I went to, the most important thing to get in was ‘It’s not your fault. It happened. How can you be part of the solution? Because you’re amazing.’
“We do not and possibly never will have a model that captures even a majority of integrity harms, particularly in sensitive areas,” one engineer would write, noting that the company’s classifiers could identify only 2 percent of prohibited hate speech with enough precision to remove it.

Inaction on the overwhelming majority of content violations was unfortunate, Rosen said, but not a reason to change course. Facebook’s bar for removing content was akin to the standard of guilt beyond a reasonable doubt applied in criminal cases. Even limiting a post’s distribution should require a preponderance of evidence. The combination of inaccurate systems and a high burden of proof would inherently mean that Facebook generally didn’t enforce its own rules against hate, Rosen acknowledged, but that was by design.

“Mark personally values free expression first and foremost and would say this is a feature, not a bug,” he wrote.

Publicly, the company declared that it had zero tolerance for hate speech. In practice, however, the company’s failure to meaningfully combat it was viewed as unfortunate—but highly tolerable.
Myanmar, ruled by a military junta that exercised near-complete control until 2011, was the sort of place where Facebook was rapidly filling in for the civil society that the government had never allowed to develop. The app offered telecommunications services, real-time news, and opportunities for activism to a society unaccustomed to them.

In 2012, ethnic violence between the country’s dominant Buddhist majority and its Rohingya Muslim minority left around two hundred people dead and prompted tens of thousands of people to flee their homes. To many, the dangers posed by Facebook in the situation seemed obvious, including to Aela Callan, a journalist and documentary filmmaker who brought them to the attention of Elliot Schrage in Facebook’s Public Policy division in 2013. All the like-minded Myanmar Cassandras received a polite audience in Menlo Park, and little more. Their argument that Myanmar was a tinderbox was validated in 2014, when a hardline Buddhist monk posted a false claim on Facebook that a Rohingya man had raped a Buddhist woman, a provocation that produced clashes, killing two people. But with the exception of Bejar’s Compassion Research team and Cox—who was personally interested in Myanmar, privately funding independent news media there as a philanthropic endeavor—nobody at Facebook paid a great deal of attention.

Later accounts of the ignored warnings led many of the company’s critics to attribute Facebook’s inaction to pure callousness, though interviews with those involved in the cleanup suggest that the root problem was incomprehension. Human rights advocates were telling Facebook not just that its platform would be used to kill people but that it already had. At a time when the company assumed that users would suss out and shut down misinformation without help, however, the information proved difficult to absorb. The version of Facebook that the company’s upper ranks knew—a patchwork of their friends, coworkers, family, and interests—couldn’t possibly be used as a tool of genocide.

Facebook eventually hired its first Burmese-language content reviewer to cover whatever issues arose in the country of more than 50 million in 2015, and released a packet of flower-themed, peace-promoting digital stickers for Burmese users to slap on hateful posts. (The company would later note that the stickers had emerged from discussions with nonprofits and were “widely celebrated by civil society groups at the time.”) At the same time, it cut deals with telecommunications providers to provide Burmese users with Facebook access free of charge.

The first wave of ethnic cleansing began later that same year, with leaders of the country’s military announcing on Facebook that they would be “solving the problem” of the country’s Muslim minority. A second wave of violence followed and, in the end, 25,000 people were killed by the military and Buddhist vigilante groups, 700,000 were forced to flee their homes, and thousands more were raped and injured. The UN branded the violence a genocide.

Facebook still wasn’t responding. On its own authority, Gomez-Uribe’s News Feed Integrity team began collecting examples of the platform giving massive distribution to statements inciting violence. Even without Burmese-language skills, it wasn’t difficult. The torrent of anti-Rohingya hate and falsehoods from the Burmese military, government shills, and firebrand monks was not just overwhelming but overwhelmingly successful.

This was exploratory work, not on the Integrity Ranking team’s half-year roadmap. When Gomez-Uribe, along with McNally and others, pushed to reassign staff to better grasp the scope of Facebook’s problem in Myanmar, they were shot down.

“We were told no,” Gomez-Uribe recalled. “It was clear that leadership didn’t want to understand it more deeply.”

That changed, as it so often did, when Facebook’s role in the problem became public. A couple of weeks after the worst violence broke out, an international human rights organization condemned Facebook for inaction. Within seventy-two hours, Gomez-Uribe’s team was urgently asked to figure out what was going on.

When it was all over, Facebook’s negligence was clear. A UN report declared that “the response of Facebook has been slow and ineffective,” and an external human rights consultant that Facebook hired eventually concluded that the platform “has become a means for those seeking to spread hate and cause harm.”

In a series of apologies, the company acknowledged that it had been asleep at the wheel and pledged to hire more staffers capable of speaking Burmese. Left unsaid was why the company screwed up. The truth was that it had no idea what was happening on its platform in most countries.
Barnes was put in charge of “meme busting”—that is, combating the spread of viral hoaxes about Facebook, on Facebook. No, the company was not going to claim permanent rights to all your photos unless you reshared a post warning of the threat. And no, Zuckerberg was not giving away money to the people who reshared a post saying so. Suppressing these digital chain letters had an obvious payoff; they tarred Facebook’s reputation and served no purpose.

Unfortunately, restricting the distribution of this junk via News Feed wasn’t enough to sink it. The posts also spread via Messenger, in large part because the messaging platform was prodding recipients of the messages to forward them on to a list of their friends.

The Advocacy team that Barnes had worked on sat within Facebook’s Growth division, and Barnes knew the guy who oversaw Messenger forwarding. Armed with data showing that the current forwarding feature was flooding the platform with anti-Facebook crap, he arranged a meeting.

Barnes’s colleague heard him out, then raised an objection.

“It’s really helping us with our goals,” the man said of the forwarding feature, which allowed users to reshare a message to a list of their friends with just a single tap. Messenger’s Growth staff had been tasked with boosting the number of “sends” that occurred each day. They had designed the forwarding feature to encourage precisely the impulsive sharing that Barnes’s team was trying to stop.

Barnes hadn’t so much lost a fight over Messenger forwarding as failed to even start one. At a time when the company was trying to control damage to its reputation, it was also being intentionally agnostic about whether its own users were slandering it. What was important was that they shared their slander via a Facebook product.

“The goal was in itself a sacred thing that couldn’t be questioned,” Barnes said. “They’d specifically created this flow to maximize the number of times that people would send messages. It was a Ferrari, a machine designed for one thing: infinite scroll.”
Entities like Liftable Media, a digital media company run by longtime Republican operative Floyd Brown, had built an empire on pages that began by spewing upbeat clickbait, then pivoted to supporting Trump ahead of the 2016 election. To compound its growth, Liftable began buying up other spammy political Facebook pages with names like “Trump Truck,” “Patriot Update,” and “Conservative Byte,” running its content through them.

In the old world of media, the strategy of managing loads of interchangeable websites and Facebook pages wouldn’t make sense. For both economies of scale and to build a brand, print and video publishers targeted each audience through a single channel. (The publisher of Cat Fancy might expand into Bird Fancy, but was unlikely to cannibalize its audience by creating a near-duplicate magazine called Cat Enthusiast.)

That was old media, though. On Facebook, flooding the zone with competing pages made sense because of some algorithmic quirks. First, the algorithm favored variety. To prevent a single popular and prolific content producer from dominating users’ feeds, Facebook blocked any publisher from appearing too frequently. Running dozens of near-duplicate pages sidestepped that, giving the same content more bites at the apple.

Coordinating a network of pages provided a second, greater benefit. It fooled a News Feed feature that promoted virality. News Feed had been designed to favor content that appeared to be emerging organically in many places. If multiple entities you followed were all talking about something, the odds were that you would be interested so Facebook would give that content a big boost.

The feature played right into the hands of motivated publishers. By recommending that users who followed one page like its near doppelgängers, a publisher could create overlapping audiences, using a dozen or more pages to synthetically mimic a hot story popping up everywhere at once. ... Zhang, working on the issue in 2020, found that the tactic was being used to benefit publishers (Business Insider, Daily Wire, a site named iHeartDogs), as well as political figures and just about anyone interested in gaming Facebook content distribution (Dairy Queen franchises in Thailand). Outsmarting Facebook didn’t require subterfuge. You could win a boost for your content by running it on ten different pages that were all administered by the same account.

It would be difficult to overstate the size of the blind spot that Zhang exposed when she found it ... ... Liftable was an archetype of that malleability. The company had begun as a vaguely Christian publisher of the low-calorie inspirational content that once thrived on Facebook. But News Feed was a fickle master, and by 2015 Facebook had changed its recommendations in ways that stopped rewarding things like “You Won’t Believe Your Eyes When You See This Phenomenally Festive Christmas Light Show.”

The algorithm changes sent an entire class of rival publishers like Upworthy and ViralNova into a terminal tailspin, but Liftable was a survivor. In addition to shifting toward stories with headlines like “Parents Furious: WATCH What Teacher Did to Autistic Son on Stage in Front of EVERYONE,” Liftable acquired WesternJournal.com and every large political Facebook page it could get its hands on.

This approach was hardly a secret. Despite Facebook rules prohibiting the sale of pages, Liftable issued press releases about its acquisition of “new assets”—Facebook pages with millions of followers. Once brought into the fold, the network of pages would blast out the same content.

Nobody inside or outside Facebook paid much attention to the craven amplification tactics and dubious content that publishers such as Liftable were adopting. Headlines like “The Sodomites Are Aiming for Your Kids” seemed more ridiculous than problematic. But Floyd and the publishers of such content knew what they were doing, and they capitalized on Facebook’s inattention and indifference.
The early work trying to figure out how to police publishers’ tactics had come from staffers attached to News Feed, but that team was broken up during the consolidation of integrity work under Guy Rosen ... “The News Feed integrity staffers were told not to work on this, that it wasn’t worth their time,” recalled product manager Elise Liu ... Facebook’s policies certainly made it seem like removing networks of fake accounts shouldn’t have been a big deal: the platform required users to go by their real names in the interests of accountability and safety. In practice, however, the rule that users were allowed a single account bearing their legal name generally went unenforced.
In the spring of 2018, the Civic team began agitating to address dozens of other networks of recalcitrant pages, including one tied to a site called “Right Wing News.” The network was run by Brian Kolfage, a U.S. veteran who had lost both legs and a hand to a missile in Iraq.

Harbath’s first reaction to Civic’s efforts to take down a prominent disabled veteran’s political media business was a flat no. She couldn’t dispute the details of his misbehavior—Kolfage was using fake or borrowed accounts to spam Facebook with links to vitriolic, sometimes false content. But she also wasn’t ready to shut him down for doing things that the platform had tacitly allowed.

“Facebook had let this guy build up a business using shady-ass tactics and scammy behavior, so there was some reluctance to basically say, like, ‘Sorry, the things that you’ve done every day for the last several years are no longer acceptable,’ ” she said. ... Other than simply giving up on enforcing Facebook’s rules, there wasn’t much left to try. Facebook’s Public Policy team remained uncomfortable with taking down a major domestic publisher for inauthentic amplification, and it made the Civic team prove that Kolfage’s content, in addition to his tactics, was objectionable. This hurdle became a permanent but undisclosed change in policy: cheating to manipulate Facebook’s algorithm wasn’t enough to get you kicked off the platform—you had to be promoting something bad, too.
Tests showed that the takedowns cut the amount of American political spam content by 20 percent overnight. Chakrabarti later admitted to his subordinates that he had been surprised that they had succeeded in taking a major action on domestic attempts to manipulate the platform. He had privately been expecting Facebook’s leadership to shut the effort down.
A staffer had shown Cox that a Brazilian legislator who supported the populist Jair Bolsonaro had posted a fabricated video of a voting machine that had supposedly been rigged in favor of his opponent. The doctored footage had already been debunked by fact-checkers, which normally would have provided grounds to bring the distribution of the post to an abrupt halt. But Facebook’s Public Policy team had long ago determined, after a healthy amount of discussion regarding the rule’s application to President Donald Trump, that government officials’ posts were immune from fact-checks. Facebook was therefore allowing false material that undermined Brazilians’ trust in democracy to spread unimpeded.

... Despite Civic’s concerns, voting in Brazil went smoothly. The same couldn’t be said for Civic’s colleagues over at WhatsApp. In the final days of the Brazilian election, viral misinformation transmitted by unfettered forwarding had blown up.
Supporters of the victorious Bolsonaro, who shared their candidate’s hostility toward homosexuality, were celebrating on Facebook by posting memes of masked men holding guns and bats. The accompanying Portuguese text combined the phrase “We’re going hunting” with a gay slur, and some of the posts encouraged users to join WhatsApp groups supposedly for that violent purpose. Engagement was through the roof, prompting Facebook’s systems to spread them even further.

While the company’s hate classifiers had been good enough to detect the problem, they weren’t reliable enough to automatically remove the torrent of hate. Rather than celebrating the race’s conclusion, Civic War Room staff put out an after-hours call for help from Portuguese-speaking colleagues. One polymath data scientist, a non-Brazilian who spoke great Portuguese and happened to be gay, answered the call.

For Civic staffers, an incident like this wasn’t a good time, but it wasn’t extraordinary, either. They had come to accept that unfortunate things like this popped up on the platform sometimes, especially around election time.

It took a glance at the Portuguese-speaking data scientist to remind Barnes how strange it was that viral horrors had become so routine on Facebook. The volunteer was hard at work just like everyone else, but he was quietly sobbing as he worked. “That moment is embedded in my mind,” Barnes said. “He’s crying, and it’s going to take the Operations team ten hours to clear this.”
India was a huge target for Facebook, which had already been locked out of China, despite much effort by Zuckerberg. The CEO had jogged unmasked through Tiananmen Square as a sign that he wasn’t bothered by Beijing’s notorious air pollution. He had asked President Xi Jinping, unsuccessfully, to choose a Chinese name for his first child. The company had even worked on a secret tool that would have allowed Beijing to directly censor the posts of Chinese users. All of it was to little avail: Facebook wasn’t getting into China. By 2019, Zuckerberg had changed his tune, saying that the company didn’t want to be there—Facebook’s commitment to free expression was incompatible with state repression and censorship. Whatever solace Facebook derived from adopting this moral stance, succeeding in India became all the more vital: If Facebook wasn’t the dominant platform in either of the world’s two most populous countries, how could it be the world’s most important social network?
Civic’s work got off to an easy start because the misbehavior was obvious. Taking only perfunctory measures to cover their tracks, all major parties were running networks of inauthentic pages, a clear violation of Facebook rules.

The BJP’s IT cell seemed the most successful. The bulk of the coordinated posting could be traced to websites and pages created by Silver Touch, the company that had built Modi’s reelection campaign app. With cumulative follower accounts in excess of 10 million, the network hit both of Facebook’s agreed-upon standards for removal: they were using banned tricks to boost engagement and violating Facebook content policies by running fabricated, inflammatory quotes that allegedly exposed Modi opponents’ affection for rapists and that denigrated Muslims.

With documentation of all parties’ bad behavior in hand by early spring, the Civic staffers overseeing the project arranged an hour-long meeting in Menlo Park with Das and Harbath to make the case for a mass takedown. Das showed up forty minutes late and pointedly let the team know that, despite the ample cafés, cafeterias, and snack rooms at the office, she had just gone out for coffee. As the Civic Team’s Liu and Ghosh tried to rush through several months of research showing how the major parties were relying on banned tactics, Das listened impassively, then told them she’d have to approve any action they wanted to take.

The team pushed ahead with preparing to remove the offending pages. Mindful as ever of optics, the team was careful to package a large group of abusive pages together, some from the BJP’s network and others from the INC’s far less successful effort. With the help of Nathaniel Gleicher’s security team, a modest collection of Facebook pages traced to the Pakistani military was thrown in for good measure

Even with the attempt at balance, the effort soon got bogged down. Higher-ups’ enthusiasm for the takedowns was so lacking that Chakrabarti and Harbath had to lobby Kaplan directly before they got approval to move forward.

“I think they thought it was going to be simpler,” Harbath said of the Civic team’s efforts.

Still, Civic kept pushing. On April 1, less than two weeks before voting was set to begin, Facebook announced that it had taken down more than one thousand pages and groups in separate actions against inauthentic behavior. In a statement, the company named the guilty parties: the Pakistani military, the IT cell of the Indian National Congress, and “individuals associated with an Indian IT firm, Silver Touch.”

For anyone who knew what was truly going on, the announcement was suspicious. Of the three parties cited, the pro-BJP propaganda network was by far the largest—and yet the party wasn’t being called out like the others.

Harbath and another person familiar with the mass takedown insisted this had nothing to do with favoritism. It was, they said, simply a mess. Where the INC had abysmally failed at subterfuge, making the attribution unavoidable under Facebook’s rules, the pro-BJP effort had been run through a contractor. That fig leaf gave the party some measure of deniability, even if it might fall short of plausible.

If the announcement’s omission of the BJP wasn’t a sop to India’s ruling party, what Facebook did next certainly seemed to be. Even as it was publicly mocking the INC for getting caught, the BJP was privately demanding that Facebook reinstate the pages the party claimed it had no connection to. Within days of the takedown, Das and Kaplan’s team in Washington were lobbying hard to reinstate several BJP-connected entities that Civic had fought so hard to take down. They won, and some of the BJP pages got restored.

With Civic and Public Policy at odds, the whole messy incident got kicked up to Zuckerberg to hash out. Kaplan argued that applying American campaign standards to India and many other international markets was unwarranted. Besides, no matter what Facebook did, the BJP was overwhelmingly favored to return to power when the election ended in May, and Facebook was seriously pissing it off.

Zuckerberg concurred with Kaplan’s qualms. The company should absolutely continue to crack down hard on covert foreign efforts to influence politics, he said, but in domestic politics the line between persuasion and manipulation was far less clear. Perhaps Facebook needed to develop new rules—ones with Public Policy’s approval.

The result was a near moratorium on attacking domestically organized inauthentic behavior and political spam. Imminent plans to remove illicitly coordinated Indonesian networks of pages, groups, and accounts ahead of upcoming elections were shut down. Civic’s wings were getting clipped.
By 2019, Jin’s standing inside the company was slipping. He had made a conscious decision to stop working so much, offloading parts of his job onto others, something that did not conform to Facebook’s culture. More than that, Jin had a habit of framing what the company did in moral terms. Was this good for users? Was Facebook truly making its products better?

Other executives were careful when bringing decisions to Zuckerberg to not frame decisions in terms of right or wrong. Everyone was trying to work collaboratively, to make a better product, and whatever Zuckerberg decided was good. Jin’s proposals didn’t carry that tone. He was unfailingly respectful, but he was also clear on what he considered the range of acceptable positions. Alex Schultz, the company’s chief marketing officer, once remarked to a colleague that the problem with Jin was that he made Zuckerberg feel like shit.

In July 2019, Jin wrote a memo titled “Virality Reduction as an Integrity Strategy” and posted it in a 4,200-person Workplace group for employees working on integrity problems. “There’s a growing set of research showing that some viral channels are used for bad more than they are used for good,” the memo began. “What should our principles be around how we approach this?” Jin went on to list, with voluminous links to internal research, how Facebook’s products routinely garnered higher growth rates at the expense of content quality and user safety. Features that produced marginal usage increases were disproportionately responsible for spam on WhatsApp, the explosive growth of hate groups, and the spread of false news stories via reshares, he wrote.

None of the examples were new. Each of them had been previously cited by Product and Research teams as discrete problems that would require either a design fix or extra enforcement. But Jin was framing them differently. In his telling, they were the inexorable result of Facebook’s efforts to speed up and grow the platform.

The response from colleagues was enthusiastic. “Virality is the goal of tenacious bad actors distributing malicious content,” wrote one researcher. “Totally on board for this,” wrote another, who noted that virality helped inflame anti-Muslim sentiment in Sri Lanka after a terrorist attack. “This is 100% direction to go,” Brandon Silverman of CrowdTangle wrote.

After more than fifty overwhelmingly positive comments, Jin ran into an objection from Jon Hegeman, the executive at News Feed who by then had been promoted to head of the team. Yes, Jin was probably right that viral content was disproportionately worse than nonviral content, Hegeman wrote, but that didn’t mean that the stuff was bad on average. ... Hegeman was skeptical. If Jin was right, he responded, Facebook should probably be taking drastic steps like shutting down all reshares, and the company wasn’t in much of a mood to try. “If we remove a small percentage of reshares from people’s inventory,” Hegeman wrote, “they decide to come back to Facebook less.”
If Civic had thought Facebook’s leadership would be rattled by the discovery that the company’s growth efforts had been making Facebook’s integrity problems worse, they were wrong. Not only was Zuckerberg hostile to future anti-growth work; he was beginning to wonder whether some of the company’s past integrity efforts were misguided.

Empowered to veto not just new integrity proposals but work that had long ago been approved, the Public Policy team began declaring that some failed to meet the company’s standards for “legitimacy.” Sparing Sharing, the demotion of content pushed by hyperactive users—already dialed down by 80 percent at its adoption—was set to be dialed back completely. (It was ultimately spared but further watered down.)

“We cannot assume links shared by people who shared a lot are bad,” a writeup of plans to undo the change said. (In practice, the effect of rolling back Sparing Sharing, even in its weakened form, was unambiguous. Views of “ideologically extreme content for users of all ideologies” would immediately rise by a double-digit percentage, with the bulk of the gains going to the far right.)

“Informed Sharing”—an initiative that had demoted content shared by people who hadn’t clicked on the posts in question, and which had proved successful in diminishing the spread of fake news—was also slated for decommissioning.

“Being less likely to share content after reading it is not a good indicator of integrity,” stated a document justifying the planned discontinuation.

A company spokeswoman denied numerous Integrity staffers’ contention that the Public Policy team had the ability to veto or roll back integrity changes, saying that Kaplan’s team was just one voice among many internally. But, regardless of who was calling the shots, the company’s trajectory was clear. Facebook wasn’t just slow-walking integrity work anymore. It was actively planning to undo large chunks of it.
Facebook could be certain of meeting its goals for the 2020 election if it was willing to slow down viral features. This could include imposing limits on reshares, message forwarding, and aggressive algorithmic amplification—the kind of steps that the Integrity teams throughout Facebook had been pushing to adopt for more than a year. The moves would be simple and cheap. Best of all, the methods had been tested and guaranteed success in combating longstanding problems.

The correct choice was obvious, Jin suggested, but Facebook seemed strangely unwilling to take it. It would mean slowing down the platform’s growth, the one tenet that was inviolable.

“Today the bar to ship a pro-Integrity win (that may be negative to engagement) often is higher than the bar to ship pro-engagement win (that may be negative to Integrity),” Jin lamented. If the situation didn’t change, he warned, it risked a 2020 election disaster from “rampant harmful virality.”
Even including downranking, “we estimate that we may action as little as 3–5% of hate and 0.6% of [violence and incitement] on Facebook, despite being the best in the world at it,” one presentation noted. Jin knew these stats, according to people who worked with him, but was too polite to emphasize them.
Company researchers used multiple methods to demonstrate QAnon’s gravitational pull, but the simplest and most visceral proof came from setting up a test account and seeing where Facebook’s algorithms took it.

After setting up a dummy account for “Carol”—a hypothetical forty-one-year-old conservative woman in Wilmington, North Carolina, whose interests included the Trump family, Fox News, Christianity, and parenting—the researcher watched as Facebook guided Carol from those mainstream interests toward darker places.

Within a day, Facebook’s recommendations had “devolved toward polarizing content.” Within a week, Facebook was pushing a “barrage of extreme, conspiratorial, and graphic content.” ... The researcher’s write-up included a plea for action: if Facebook was going to push content this hard, the company needed to get a lot more discriminating about what it pushed.

Later write-ups would acknowledge that such warnings went unheeded.
As executives filed out, Zuckerberg pulled Integrity’s Guy Rosen aside. “Why did you show me this in front of so many people?” Zuckerberg asked Rosen, who as Chakrabarti’s boss bore responsibility for his subordinate’s presentation landing on that day’s agenda.

Zuckerberg had good reason to be unhappy that so many executives had watched him being told in plain terms that the forthcoming election was shaping up to be a disaster. In the course of investigating Cambridge Analytica, regulators around the world had already subpoenaed thousands of pages of documents from the company and had pushed for Zuckerberg’s personal communications going back for the better part of the decade. Facebook had paid $5 billion to the U.S. Federal Trade Commission to settle one of the most prominent inquiries, but the threat of subpoenas and depositions wasn’t going away. ... If there had been any doubt that Civic was the Integrity division’s problem child, lobbing such a damning document straight onto Zuckerberg’s desk settled it. As Chakrabarti later informed his deputies, Rosen told him that Civic would henceforth be required to run such material through other executives first—strictly for organizational reasons, of course.

Chakrabarti didn’t take the reining in well. A few months later, he wrote a scathing appraisal of Rosen’s leadership as part of the company’s semiannual performance review. Facebook’s top integrity official was, he wrote, “prioritizing PR risk over social harm.”
Facebook still hadn’t given Civic the green light to resume the fight against domestically coordinated political manipulation efforts. Its fact-checking program was too slow to effectively shut down the spread of misinformation during a crisis. And the company still hadn’t addressed the “perverse incentives” resulting from News Feed’s tendency to favor divisive posts. “Remains unclear if we have a societal responsibility to reduce exposure to this type of content,” an updated presentation from Civic tartly stated.

“Samidh was trying to push Mark into making those decisions, but he didn’t take the bait,” Harbath recalled.
Cutler remarked that she would have pushed for Chakrabarti’s ouster if she didn’t expect a substantial portion of his team would mutiny. (The company denies Cutler said this.)
a British study had found that Instagram had the worst effect of any social media app on the health and well-being of teens and young adults.
The second was the death of Molly Russell, a fourteen-year-old from North London. Though “apparently flourishing,” as a later coroner’s inquest found, Russell had died by suicide in late 2017. Her death was treated as an inexplicable local tragedy until the BBC ran a report on social media activity in 2019. Russell had followed a large group of accounts that romanticized depression, self-harm, and suicide, and she had engaged with more than 2,100 macabre posts, mostly on Instagram. Her final login had come at 12:45 the morning she died.

“I have no doubt that Instagram helped kill my daughter,” her father told the BBC.

Later research—both inside and outside Instagram—would demonstrate that a class of commercially motivated accounts had seized on depression-related content for the same reason that others focused on car crashes or fighting: the stuff pulled high engagement. But serving pro-suicide content to a vulnerable kid was clearly indefensible, and the platform pledged to remove and restrict the recommendation of such material, along with hiding hashtags like #Selfharm. Beyond exposing an operational failure, the extensive coverage of Russell’s death associated Instagram with rising concerns about teen mental health.
Though much attention, both inside and outside the company, had been paid to bullying, the most serious risks weren’t the result of people mistreating each other. Instead, the researchers wrote, harm arose when a user’s existing insecurities combined with Instagram’s mechanics. “Those who are dissatisfied with their lives are more negatively affected by the app,” one presentation noted, with the effects most pronounced among girls unhappy with their bodies and social standing.

There was a logic here, one that teens themselves described to researchers. Instagram’s stream of content was a “highlight reel,” at once real life and unachievable. This was manageable for users who arrived in a good frame of mind, but it could be poisonous for those who showed up vulnerable. Seeing comments about how great an acquaintance looked in a photo would make a user who was unhappy about her weight feel bad—but it didn’t make her stop scrolling.

“They often feel ‘addicted’ and know that what they’re seeing is bad for their mental health but feel unable to stop themselves,” the “Teen Mental Health Deep Dive” presentation noted. Field research in the U.S. and U.K. found that more than 40 percent of Instagram users who felt “unattractive” traced that feeling to Instagram. Among American teens who said they had thought about dying by suicide in the past month, 6 percent said the feeling originated on the platform. In the U.K., the number was double that.

“Teens who struggle with mental health say Instagram makes it worse,” the presentation stated. “Young people know this, but they don’t adopt different patterns.”

These findings weren’t dispositive, but they were unpleasant, in no small part because they made sense. Teens said—and researchers appeared to accept—that certain features of Instagram could aggravate mental health issues in ways beyond its social media peers. Snapchat had a focus on silly filters and communication with friends, while TikTok was devoted to performance. Instagram, though? It revolved around bodies and lifestyle. The company disowned these findings after they were made public, calling the researchers’ apparent conclusion that Instagram could harm users with preexisting insecurities unreliable. The company would dispute allegations that it had buried negative research findings as “plain false.”
Facebook had deployed a comment-filtering system to prevent the heckling of public figures such as Zuckerberg during livestreams, burying not just curse words and complaints but also substantive discussion of any kind. The system had been tuned for sycophancy, and poorly at that. The irony of heavily censoring comments on a speech about free speech wasn’t hard to miss.
CrowdTangle’s rundown of that Tuesday’s top content had, it turned out, included a butthole. This wasn’t a borderline picture of someone’s ass. It was an unmistakable, up-close image of an anus. It hadn’t just gone big on Facebook—it had gone biggest. Holding the number one slot, it was the lead item that executives had seen when they opened Silverman’s email. “I hadn’t put Mark or Sheryl on it, but I basically put everyone else on there,” Silverman said.

The picture was a thumbnail outtake from a porn video that had escaped Facebook’s automated filters. Such errors were to be expected, but was Facebook’s familiarity with its platform so poor that it wouldn’t notice when its systems started spreading that content to millions of people?

Yes, it unquestionably was.
In May, a data scientist working on integrity posted a Workplace note titled “Facebook Creating a Big Echo Chamber for ‘the Government and Public Health Officials Are Lying to Us’ Narrative—Do We Care?”

Just a few months into the pandemic, groups devoted to opposing COVID lockdown measures had become some of the most widely viewed on the platform, pushing false claims about the pandemic under the guise of political activism. Beyond serving as an echo chamber for alternating claims that the virus was a Chinese plot and that the virus wasn’t real, the groups served as a staging area for platform-wide assaults on mainstream medical information. ... An analysis showed these groups had appeared abruptly, and while they had ties to well-established anti-vaccination communities, they weren’t arising organically. Many shared near-identical names and descriptions, and an analysis of their growth showed that “a relatively small number of people” were sending automated invitations to “hundreds or thousands of users per day.”

Most of this didn’t violate Facebook’s rules, the data scientist noted in his post. Claiming that COVID was a plot by Bill Gates to enrich himself from vaccines didn’t meet Facebook’s definition of “imminent harm.” But, he said, the company should think about whether it was merely reflecting a widespread skepticism of COVID or creating one.

“This is severely impacting public health attitudes,” a senior data scientist responded. “I have some upcoming survey data that suggests some baaaad results.”
President Trump was gearing up for reelection and he took to his platform of choice, Twitter, to launch what would become a monthslong attempt to undermine the legitimacy of the November 2020 election. “There is no way (ZERO!) that Mail-In Ballots will be anything less than substantially fraudulent,” Trump wrote. As was standard for Trump’s tweets, the message was cross-posted on Facebook.

Under the tweet, Twitter included a small alert that encouraged users to “Get the facts about mail-in ballots.” Anyone clicking on it was informed that Trump’s allegations of a “rigged” election were false and there was no evidence that mail-in ballots posed a risk of fraud.

Twitter had drawn its line. Facebook now had to choose where it stood. Monika Bickert, Facebook’s head of Content Policy, declared that Trump’s post was right on the edge of the sort of misinformation about “methods for voting” that the company had already pledged to take down.

Zuckerberg didn’t have a strong position, so he went with his gut and left it up. But then he went on Fox News to attack Twitter for doing the opposite. “I just believe strongly that Facebook shouldn’t be the arbiter of truth of everything that people say online,” he told host Dana Perino. “Private companies probably shouldn’t be, especially these platform companies, shouldn’t be in the position of doing that.”

The interview caused some tumult inside Facebook. Why would Zuckerberg encourage Trump’s testing of the platform’s boundaries by declaring its tolerance of the post a matter of principle? The perception that Zuckerberg was kowtowing to Trump was about to get a lot worse. On the day of his Fox News interview, protests over the recent killing of George Floyd by Minneapolis police officers had gone national, and the following day the president tweeted that “when the looting starts, the shooting starts”—a notoriously menacing phrase used by a white Miami police chief during the civil rights era.

Declaring that Trump had violated its rules against glorifying violence, Twitter took the rare step of limiting the public’s ability to see the tweet—users had to click through a warning to view it, and they were prevented from liking or retweeting it.

Over on Facebook, where the message had been cross-posted as usual, the company’s classifier for violence and incitement estimated it had just under a 90 percent probability of breaking the platform’s rules—just shy of the threshold that would get a regular user’s post automatically deleted.

Trump wasn’t a regular user, of course. As a public figure, arguably the world’s most public figure, his account and posts were protected by dozens of different layers of safeguards.
Facebook drew up a list of accounts that were immune to some or all immediate enforcement actions. If those accounts appeared to break Facebook’s rules, the issue would go up the chain of Facebook’s hierarchy and a decision would be made on whether to take action against the account or not. Every social media platform ended up creating similar lists—it didn’t make sense to adjudicate complaints about heads of state, famous athletes, or persecuted human rights advocates in the same way the companies did with run-of-the-mill users. The problem was that, like a lot of things at Facebook, the company’s process got particularly messy.

For Facebook, the risks that arose from shielding too few users were seen as far greater than the risks of shielding too many. Erroneously removing a bigshot’s content could unleash public hell—in Facebook parlance, a “media escalation” or, that most dreaded of events, a “PR fire.” Hours or days of coverage would follow when Facebook erroneously removed posts from breast cancer victims or activists of all stripes. When it took down a photo of a risqué French magazine cover posted to Instagram by the American singer Rihanna in 2014, it nearly caused an international incident. As internal reviews of the system later noted, the incentive was to shield as heavily as possible any account with enough clout to cause undue attention.

No one team oversaw XCheck, and the term didn’t even have a specific definition. There were endless varieties and gradations applied to advertisers, posts, pages, and politicians, with hundreds of engineers around the company coding different flavors of protections and tagging accounts as needed. Eventually, at least 6 million accounts and pages were enrolled into XCheck, with an internal guide stating that an entity should be “newsworthy,” “influential or popular,” or “PR risky” to qualify. On Instagram, XCheck even covered popular animal influencers, including Doug the Pug.

Any Facebook employee who knew the ropes could go into the system and flag accounts for special handling. XCheck was used by more than forty teams inside the company. Sometimes there were records of how they had deployed it and sometimes there were not. Later reviews would find that XCheck’s protections had been granted to “abusive accounts” and “persistent violators” of Facebook’s rules.

The job of giving a second review to violating content from high-profile users would require a sizable team of full-time employees. Facebook simply never staffed one. Flagged posts were put into a queue that no one ever considered, sweeping already once-validated complaints under the digital rug. “Because there was no governance or rigor, those queues might as well not have existed,” recalled someone who worked with the system. “The interest was in protecting the business, and that meant making sure we don’t take down a whale’s post.”

The stakes could be high. XCheck protected high-profile accounts, including in Myanmar, where public figures were using Facebook to incite genocide. It shielded the account of British far-right figure Tommy Robinson, an investigation by Britain’s Channel Four revealed in 2018.

One of the most explosive cases was that of Brazilian soccer star Neymar, whose 150 million Instagram followers placed him among the platform’s top twenty influencers. After a woman accused Neymar of rape in 2019, he accused the woman of extorting him and posted Facebook and Instagram videos defending himself—and showing viewers his WhatsApp correspondence with his accuser, which included her name and nude photos of her. Facebook’s procedure for handling the posting of “non-consensual intimate imagery” was simple: delete it. But Neymar was protected by XCheck. For more than a day, the system blocked Facebook’s moderators from removing the video. An internal review of the incident found that 56 million Facebook and Instagram users saw what Facebook described in a separate document as “revenge porn,” exposing the woman to what an employee referred to in the review as “ongoing abuse” from other users.

Facebook’s operational guidelines stipulate that not only should unauthorized nude photos be deleted, but people who post them should have their accounts deleted. Faced with the prospect of scrubbing one of the world’s most famous athletes from its platform, Facebook blinked.

“After escalating the case to leadership,” the review said, “we decided to leave Neymar’s accounts active, a departure from our usual ‘one strike’ profile disable policy.”

Facebook knew that providing preferential treatment to famous and powerful users was problematic at best and unacceptable at worst. “Unlike the rest of our community, these people can violate our standards without any consequences,” a 2019 review noted, calling the system “not publicly defensible.”

Nowhere did XCheck interventions occur more than in American politics, especially on the right.
When a high-enough-profile account was conclusively found to have broken Facebook’s rules, the company would delay taking action for twenty-four hours, during which it tried to convince the offending party to remove the offending post voluntarily. The program served as an invitation for privileged accounts to play at the edge of Facebook’s tolerance. If they crossed the line, they could simply take it back, having already gotten most of the traffic they would receive anyway. (Along with Diamond and Silk, every member of Congress ended up being granted the self-remediation window.)

Sometimes Kaplan himself got directly involved. According to documents first obtained by BuzzFeed, the global head of Public Policy was not above either pushing employees to lift penalties against high-profile conservatives for spreading false information or leaning on Facebook’s fact-checkers to alter their verdicts.

An understanding began to dawn among the politically powerful: if you mattered enough, Facebook would often cut you slack. Prominent entities rightly treated any significant punishment as a sign that Facebook didn’t consider them worthy of white-glove treatment. To prove the company wrong, they would scream as loudly as they could in response.

“Some of these people were real gems,” recalled Harbath. In Facebook’s Washington, DC, office, staffers would explicitly justify blocking penalties against “Activist Mommy,” a Midwestern Christian account with a penchant for anti-gay rhetoric, because she would immediately go to the conservative press.

Facebook’s fear of messing up with a major public figure was so great that some achieved a status beyond XCheck and were whitelisted altogether, rendering even their most vile content immune from penalties, downranking, and, in some cases, even internal review.
Other Civic colleagues and Integrity staffers piled into the comments section to concur. “If our goal, was say something like: have less hate, violence etc. on our platform to begin with instead of remove more hate, violence etc. our solutions and investments would probably look quite different,” one wrote.

Rosen was getting tired of dealing with Civic. Zuckerberg, who famously did not like to revisit decisions once they were made, had already dictated his preferred approach: automatically remove content if Facebook’s classifiers were highly confident that it broke the platform’s rules and take “soft” actions such as demotions when the systems predicted a violation was more likely than not. These were the marching orders and the only productive path forward was to diligently execute them.
The week before, the Wall Street Journal had published a story my colleague Newley Purnell and I cowrote about how Facebook had exempted a firebrand Hindu politician from its hate speech enforcement. There had been no question that Raja Singh, a member of the Telangana state parliament, was inciting violence. He gave speeches calling for Rohingya immigrants who fled genocide in Myanmar to be shot, branded all Indian Muslims traitors, and threatened to raze mosques. He did these things while building an audience of more than 400,000 followers on Facebook. Earlier that year, police in Hyderabad had placed him under house arrest to prevent him from leading supporters to the scene of recent religious violence.

That Facebook did nothing in the face of such rhetoric could have been due to negligence—there were a lot of firebrand politicians offering a lot of incitement in a lot of different languages around the world. But in this case, Facebook was well aware of Singh’s behavior. Indian civil rights groups had brought him to the attention of staff in both Delhi and Menlo Park as part of their efforts to pressure the company to act against hate speech in the country.

There was no question whether Singh qualified as a “dangerous individual,” someone who would normally be barred from having a presence on Facebook’s platforms. Despite the internal conclusion that Singh and several other Hindu nationalist figures were creating a risk of actual bloodshed, their designation as hate figures had been blocked by Ankhi Das, Facebook’s head of Indian Public Policy—the same executive who had lobbied years earlier to reinstate BJP-associated pages after Civic had fought to take them down.

Das, whose job included lobbying India’s government on Facebook’s behalf, didn’t bother trying to justify protecting Singh and other Hindu nationalists on technical or procedural grounds. She flatly said that designating them as hate figures would anger the government, and the ruling BJP, so the company would not be doing it. ... Following our story, Facebook India’s then–managing director Ajit Mohan assured the company’s Muslim employees that we had gotten it wrong. Facebook removed hate speech “as soon as it became aware of it” and would never compromise its community standards for political purposes. “While we know there is more to do, we are making progress every day,” he wrote.

It was after we published the story that Kiran (a pseudonym) reached out to me. They wanted to make clear that our story in the Journal had just scratched the surface. Das’s ties with the government were far tighter than we understood, they said, and Facebook India was protecting entities much more dangerous than Singh.
“Hindus, come out. Die or kill,” one prominent activist had declared during a Facebook livestream, according to a later report by retired Indian civil servants. The ensuing violence left fifty-three people dead and swaths of northeastern Delhi burned.
The researcher set up a dummy account while traveling. Because the platform factored a user’s geography into content recommendations, she and a colleague noted in a writeup of her findings, it was the only way to get a true read on what the platform was serving up to a new Indian user.

Ominously, her summary of what Facebook had recommended to their notional twenty-one-year-old Indian woman began with a trigger warning for graphic violence. While Facebook’s push of American test users toward conspiracy theories had been concerning, the Indian version was dystopian.

“In the 3 weeks since the account has been opened, by following just this recommended content, the test user’s News Feed has become a near constant barrage of polarizing nationalist content, misinformation, and violence and gore,” the note stated. The dummy account’s feed had turned especially dark after border skirmishes between Pakistan and India in early 2019. Amid a period of extreme military tensions, Facebook funneled the user toward groups filled with content promoting full-scale war and mocking images of corpses with laughing emojis.

This wasn’t a case of bad posts slipping past Facebook’s defenses, or one Indian user going down a nationalistic rabbit hole. What Facebook was recommending to the young woman had been bad from the start. The platform had pushed her to join groups clogged with images of corpses, watch purported footage of fictional air strikes, and congratulate nonexistent fighter pilots on their bravery.

“I’ve seen more images of dead people in the past three weeks than I’ve seen in my entire life, total,” the researcher wrote, noting that the platform had allowed falsehoods, dehumanizing rhetoric, and violence to “totally take over during a major crisis event.” Facebook needed to consider not only how its recommendation systems were affecting “users who are different from us,” she concluded, but rethink how it built its products for “non-US contexts.”

India was not an outlier. Outside of English-speaking countries and Western Europe, users routinely saw more cruelty, engagement bait, and falsehoods. Perhaps differing cultural senses of propriety explained some of the gap, but a lot clearly stemmed from differences in investment and concern.
This wasn’t supposed to be legal in the Gulf under the gray-market labor sponsorship system known as kafala, but the internet had removed the friction from buying people. Undercover reporters from BBC Arabic posed as a Kuwaiti couple and negotiated to buy a sixteen-year-old girl whose seller boasted about never allowing her to leave the house.

Everyone told the BBC they were horrified. Kuwaiti police rescued the girl and sent her home. Apple and Google pledged to root out the abuse, and the bartering apps cited in the story deleted their “domestic help” sections. Facebook pledged to take action and deleted a popular hashtag used to advertise maids for sale.

After that, the company largely dropped the matter. But Apple turned out to have a longer attention span. In October, after sending Facebook numerous examples of ongoing maid sales via Instagram, it threatened to remove Facebook’s products from its App Store.

Unlike human trafficking, this, to Facebook, was a real crisis.

“Removing our applications from Apple’s platforms would have had potentially severe consequences to the business, including depriving millions of users of access to IG & FB,” an internal report on the incident stated.

With alarm bells ringing at the highest levels, the company found and deleted an astonishing 133,000 posts, groups, and accounts related to the practice within days. It also performed a quick revamp of its policies, reversing a previous rule allowing the sale of maids through “brick and mortar” businesses. (To avoid upsetting the sensibilities of Gulf State “partners,” the company had previously permitted the advertising and sale of servants by businesses with a physical address.) Facebook also committed to “holistic enforcement against any and all content promoting domestic servitude,” according to the memo.

Apple lifted its threat, but again Facebook wouldn’t live up to its pledges. Two years later, in late 2021, an Integrity staffer would write up an investigation titled “Domestic Servitude: This Shouldn’t Happen on FB and How We Can Fix It.” Focused on the Philippines, the memo described how fly-by-night employment agencies were recruiting women with “unrealistic promises” and then selling them into debt bondage overseas. If Instagram was where domestic servants were sold, Facebook was where they were recruited.

Accessing the direct-messaging inboxes of the placing agencies, the staffer found Filipina domestic servants pleading for help. Some reported rape or sent pictures of bruises from being hit. Others hadn’t been paid in months. Still others reported being locked up and starved. The labor agencies didn’t help.

The passionately worded memo, and others like it, listed numerous things the company could do to prevent the abuse. There were improvements to classifiers, policy changes, and public service announcements to run. Using machine learning, Facebook could identify Filipinas who were looking for overseas work and then inform them of how to spot red flags in job postings. In Persian Gulf countries, Instagram could run PSAs about workers’ rights.

These things largely didn’t happen for a host of reasons. One memo noted a concern that, if worded too strongly, Arabic-language PSAs admonishing against the abuse of domestic servants might “alienate buyers” of them. But the main obstacle, according to people familiar with the team, was simply resources. The team devoted full-time to human trafficking—which included not just the smuggling of people for labor and sex but also the sale of human organs—amounted to a half-dozen people worldwide. The team simply wasn’t large enough to knock this stuff out.
“We’re largely blind to problems on our site,” Leach’s presentation wrote of Ethiopia.

Facebook employees produced a lot of internal work like this: declarations that the company had gotten in over its head, unable to provide even basic remediation to potentially horrific problems. Events on the platform could foreseeably lead to loss of life and almost certainly did, according to human rights groups monitoring Ethiopia. Meareg Amare, a university lecturer in Addis Ababa, was murdered outside his home one month after a post went viral, receiving 35,000 likes, listing his home address and calling for him to be attacked. Facebook failed to remove it. His family is now suing the company.

As it so often did, the company was choosing growth over quality. Efforts to expand service to poorer and more isolated places would not wait for user protections to catch up, and, even in countries at “dire” risk of mass atrocities, the At Risk Countries team needed approval to do things that harmed engagement.
Documents and transcripts of internal meetings among the company’s American staff show employees struggling to explain why Facebook wasn’t following its normal playbook when dealing with hate speech, the coordination of violence, and government manipulation in India. Employees in Menlo Park discussed the BJP’s promotion of the “Love Jihad” lie. They met with human rights organizations that documented the violence committed by the platform’s cow-protection vigilantes. And they tracked efforts by the Indian government and its allies to manipulate the platform via networks of accounts. Yet nothing changed.

“We have a lot of business in India, yeah. And we have connections with the government, I guess, so there are some sensitivities around doing a mitigation in India,” one employee told another about the company’s protracted failure to address abusive behavior by an Indian intelligence service.

During another meeting, a team working on what it called the problem of “politicized hate” informed colleagues that the BJP and its allies were coordinating both the “Love Jihad” slander and another hashtag, #CoronaJihad, premised on the idea that Muslims were infecting Hindus with COVID via halal food.

The Rashtriya Swayamsevak Sangh, or RSS—the umbrella Hindu nationalist movement of which the BJP is the political arm—was promoting these slanders through 6,000 or 7,000 different entities on the platform, with the goal of portraying Indian Muslims as subhuman, the presenter explained. Some of the posts said that the Quran encouraged Muslim men to rape their female family members.

“What they’re doing really permeates Indian society,” the presenter noted, calling it part of a “larger war.”

A colleague at the meeting asked the obvious question. Given the company’s conclusive knowledge of the coordinated hate campaign, why hadn’t the posts or accounts been taken down?

“Ummm, the answer that I’ve received for the past year and a half is that it’s too politically sensitive to take down RSS content as hate,” the presenter said.

Nothing needed to be said in response.

“I see your face,” the presenter said. “And I totally agree.”
One incident in particular, involving a local political candidate, stuck out. As Kiran recalled it, the guy was a little fish, a Hindu nationalist activist who hadn’t achieved Raja Singh’s six-digit follower count but was still a provocateur. The man’s truly abhorrent behavior had been repeatedly flagged by lower-level moderators, but somehow the company always seemed to give it a pass.

This time was different. The activist had streamed a video in which he and some accomplices kidnapped a man who, they informed the camera, had killed a cow. They took their captive to a construction site and assaulted him while Facebook users heartily cheered in the comments section.
Zuckerberg launched an internal campaign against social media overenforcement. Ordering the creation of a team dedicated to preventing wrongful content takedowns, Zuckerberg demanded regular briefings on its progress from senior employees. He also suggested that, instead of rigidly enforcing platform rules on content in Groups, Facebook should defer more to the sensibilities of the users in them. In response, a staffer proposed entirely exempting private groups from enforcement for “low-tier hate speech.”
The stuff was viscerally terrible—people clamoring for lynchings and civil war. One group was filled with “enthusiastic calls for violence every day.” Another top group claimed it was set up by Trump-supporting patriots but was actually run by “financially motivated Albanians” directing a million views daily to fake news stories and other provocative content.

The comments were often worse than the posts themselves, and even this was by design. The content of the posts would be incendiary but fall just shy of Facebook’s boundaries for removal—it would be bad enough, however, to harvest user anger, classic “hate bait.” The administrators were professionals, and they understood the platform’s weaknesses every bit as well as Civic did. In News Feed, anger would rise like a hot-air balloon, and such comments could take a group to the top.

Public Policy had previously refused to act on hate bait
We have heavily overpromised regarding our ability to moderate content on the platform,” one data scientist wrote to Rosen in September. “We are breaking and will continue to break our recent promises.”
The longstanding conflicts between Civic and Facebook’s Product, Policy, and leadership teams had boiled over in the wake of the “looting/shooting” furor, and executives—minus Chakrabarti—had privately begun discussing how to address what was now unquestionably viewed as a rogue Integrity operation. Civic, with its dedicated engineering staff, hefty research operation, and self-chosen mission statement, was on the chopping block.
The group had grown to more than 360,000 members less than twenty-four hours later when Facebook took it down, citing “extraordinary measures.” Pushing false claims of election fraud to a mass audience at a time when armed men were calling for a halt to vote counting outside tabulation centers was an obvious problem, and one that the company knew was only going to get bigger. Stop the Steal had an additional 2.1 million users pending admission to the group when Facebook pulled the plug.

Facebook’s leadership would describe Stop the Steal’s growth as unprecedented, though Civic staffers could be forgiven for not sharing their sense of surprise.
Zuckerberg had accepted the deletion under emergency circumstances, but he didn’t want the Stop the Steal group’s removal to become a precedent for a backdoor ban on false election claims. During the run-up to Election Day, Facebook had removed only lies about the actual voting process—stuff like “Democrats vote on Wednesday” and “People with outstanding parking tickets can’t go to the polls.” Noting the thin distinction between the claim that votes wouldn’t be counted and that they wouldn’t be counted accurately, Chakrabarti had pushed to take at least some action against baseless election fraud claims.

Civic hadn’t won that fight, but with the Stop the Steal group spawning dozens of similarly named copycats—some of which also accrued six-figure memberships—the threat of further organized election delegitimization efforts was obvious.

Barred from shutting down the new entities, Civic assigned staff to at least study them. Staff also began tracking top delegitimization posts, which were earning tens of millions of views, for what one document described as “situational awareness.” A later analysis found that as much as 70 percent of Stop the Steal content was coming from known “low news ecosystem quality” pages, the commercially driven publishers that Facebook’s News Feed integrity staffers had been trying to fight for years.
Zuckerberg overruled both Facebook’s Civic team and its head of counterterrorism. Shortly after the Associated Press called the presidential election for Joe Biden on November 7—the traditional marker for the race being definitively over—Molly Cutler assembled roughly fifteen executives that had been responsible for the company’s election preparation. Citing orders from Zuckerberg, she said the election delegitimization monitoring was to immediately stop.
On December 17, a data scientist flagged that a system responsible for either deleting or restricting high-profile posts that violated Facebook’s rules had stopped doing so. Colleagues ignored it, assuming that the problem was just a “logging issue”—meaning the system still worked, it just wasn’t recording its actions. On the list of Facebook’s engineering priorities, fixing that didn’t rate.

In fact, the system truly had failed, in early November. Between then and when engineers realized their error in mid-January, the system had given a pass to 3,100 highly viral posts that should have been deleted or labeled “disturbing.”

Glitches like that happened all the time at Facebook. Unfortunately, this one produced an additional 8 billion “regrettable” views globally, instances in which Facebook had shown users content that it knew was trouble. The company would later say that only a small minority of the 8 billion “regrettable” content views touched on American politics, and that the mistake was immaterial to subsequent events. A later review of Facebook’s post-election work tartly described the flub as a “lowlight” of the platform’s 2020 election performance, though the company disputes that it had a meaningful impact. At least 7 billion of the bad content views were international, the company says, and of the American material only a portion dealt with politics. Overall, a spokeswoman said, the company remains proud of its pre- and post-election safety work.
Zuckerberg vehemently disagreed with people who said that the COVID vaccine was unsafe, but he supported their right to say it, including on Facebook. ... Under Facebook’s policy, health misinformation about COVID was to be removed only if it posed an imminent risk of harm, such as a post telling infected people to drink bleach ... A researcher randomly sampled English-language comments containing phrases related to COVID and vaccines. A full two-thirds were anti-vax. The researcher’s memo compared that figure to public polling showing the prevalence of anti-vaccine sentiment in the U.S.—it was a full 40 points lower.

Additional research found that a small number of “big whales” was behind a large portion of all anti-vaccine content on the platform. Of 150,000 posters in Facebook groups that were eventually disabled for COVID misinformation, just 5 percent were producing half of all posts. And just 1,400 users were responsible for inviting half of all members. “We found, like many problems at FB, this is a head-heavy problem with a relatively few number of actors creating a large percentage of the content and growth,” Facebook researchers would later note.

One of the anti-vax brigade’s favored tactics was to piggyback on posts from entities like UNICEF and the World Health Organization encouraging vaccination, which Facebook was promoting free of charge. Anti-vax activists would respond with misinformation or derision in the comments section of these posts, then boost one another’s hostile comments toward the top slot
Even as Facebook prepared for virally driven crises to become routine, the company’s leadership was becoming increasingly comfortable absolving its products of responsibility for feeding them. By the spring of 2021, it wasn’t just Boz arguing that January 6 was someone else’s problem. Sandberg suggested that January 6 was “largely organized on platforms that don’t have our abilities to stop hate.” Zuckerberg told Congress that they need not cast blame beyond Trump and the rioters themselves. “The country is deeply divided right now and that is not something that tech alone can fix,” he said.

In some instances, the company appears to have publicly cited research in what its own staff had warned were inappropriate ways. A June 2020 review of both internal and external research had warned that the company should avoid arguing that higher rates of polarization among the elderly—the demographic that used social media least—was proof that Facebook wasn’t causing polarization.

Though the argument was favorable to Facebook, researchers wrote, Nick Clegg should avoid citing it in an upcoming opinion piece because “internal research points to an opposite conclusion.” Facebook, it turned out, fed false information to senior citizens at such a massive rate that they consumed far more of it despite spending less time on the platform. Rather than vindicating Facebook, the researchers wrote, “the stronger growth of polarization for older users may be driven in part by Facebook use.”

All the researchers wanted was for executives to avoid parroting a claim that Facebook knew to be wrong, but they didn’t get their wish. The company says the argument never reached Clegg. When he published a March 31, 2021, Medium essay titled “You and the Algorithm: It Takes Two to Tango,” he cited the internally debunked claim among the “credible recent studies” disproving that “we have simply been manipulated by machines all along.” (The company would later say that the appropriate takeaway from Clegg’s essay on polarization was that “research on the topic is mixed.”)

Such bad-faith arguments sat poorly with researchers who had worked on polarization and analyses of Stop the Steal, but Clegg was a former politician hired to defend Facebook, after all. The real shock came from an internally published research review written by Chris Cox.

Titled “What We Know About Polarization,” the April 2021 Workplace memo noted that the subject remained “an albatross public narrative,” with Facebook accused of “driving societies into contexts where they can’t trust each other, can’t share common ground, can’t have conversations about issues, and can’t share a common view on reality.”

But Cox and his coauthor, Facebook Research head Pratiti Raychoudhury, were happy to report that a thorough review of the available evidence showed that this “media narrative” was unfounded. The evidence that social media played a contributing role in polarization, they wrote, was “mixed at best.” Though Facebook likely wasn’t at fault, Cox and Raychoudhury wrote, the company was still trying to help, in part by encouraging people to join Facebook groups. “We believe that groups are on balance a positive, depolarizing force,” the review stated.

The writeup was remarkable for its choice of sources. Cox’s note cited stories by New York Times columnists David Brooks and Ezra Klein alongside early publicly released Facebook research that the company’s own staff had concluded was no longer accurate. At the same time, it omitted the company’s past conclusions, affirmed in another literature review just ten months before, that Facebook’s recommendation systems encouraged bombastic rhetoric from publishers and politicians, as well as previous work finding that seeing vicious posts made users report “more anger towards people with different social, political, or cultural beliefs.” While nobody could reliably say how Facebook altered users’ off-platform behavior, how the company shaped their social media activity was accepted fact. “The more misinformation a person is exposed to on Instagram the more trust they have in the information they see on Instagram,” company researchers had concluded in late 2020.

In a statement, the company called the presentation “comprehensive” and noted that partisan divisions in society arose “long before platforms like Facebook even existed.” For staffers that Cox had once assigned to work on addressing known problems of polarization, his note was a punch to the gut.
In 2016, the New York Times had reported that Facebook was quietly working on a censorship tool in an effort to gain entry to the Chinese market. While the story was a monster, it didn’t come as a surprise to many people inside the company. Four months earlier, an engineer had discovered that another team had modified a spam-fighting tool in a way that would allow an outside party control over content moderation in specific geographic regions. In response, he had resigned, leaving behind a badge post correctly surmising that the code was meant to loop in Chinese censors.

With a literary mic drop, the post closed out with a quote on ethics from Charlotte Brontë’s Jane Eyre: “Laws and principles are not for the times when there is no temptation: they are for such moments as this, when body and soul rise in mutiny against their rigour; stringent are they; inviolate they shall be. If at my individual convenience I might break them, what would be their worth?”

Garnering 1,100 reactions, 132 comments, and 57 shares, the post took the program from top secret to open secret. Its author had just pioneered a new template: the hard-hitting Facebook farewell.

That particular farewell came during a time when Facebook’s employee satisfaction surveys were generally positive, before the time of endless crisis, when societal concerns became top of mind. In the intervening years, Facebook had hired a massive base of Integrity employees to work on those issues, and seriously pissed off a nontrivial portion of them.

Consequently, some badge posts began to take on a more mutinous tone. Staffers who had done groundbreaking work on radicalization, human trafficking, and misinformation would summarize both their accomplishments and where they believed the company had come up short on technical and moral grounds. Some broadsides against the company ended on a hopeful note, including detailed, jargon-light instructions for how, in the future, their successors could resurrect the work.

These posts were gold mines for Haugen, connecting product proposals, experimental results, and ideas in ways that would have been impossible for an outsider to re-create. She photographed not just the posts themselves but the material they linked to, following the threads to other topics and documents. A half dozen were truly incredible, unauthorized chronicles of Facebook’s dawning understanding of the way its design determined what its users consumed and shared. The authors of these documents hadn’t been trying to push Facebook toward social engineering—they had been warning that the company had already wandered into doing so and was now neck deep.
The researchers’ best understanding was summarized this way: “We make body image issues worse for one in three teen girls.”
In 2020, Instagram’s Well-Being team had run a study of massive scope, surveying 100,000 users in nine countries about negative social comparison on Instagram. The researchers then paired the answers with individualized data on how each user who took the survey had behaved on Instagram, including how and what they posted. They found that, for a sizable minority of users, especially those in Western countries, Instagram was a rough place. Ten percent reported that they “often or always” felt worse about themselves after using the platform, and a quarter believed Instagram made negative comparison worse.

Their findings were incredibly granular. They found that fashion and beauty content produced negative feelings in ways that adjacent content like fitness did not. They found that “people feel worse when they see more celebrities in feed,” and that Kylie Jenner seemed to be unusually triggering, while Dwayne “The Rock” Johnson was no trouble at all. They found that people judged themselves far more harshly against friends than celebrities. A movie star’s post needed 10,000 likes before it caused social comparison, whereas, for a peer, the number was ten.

In order to confront these findings, the Well-Being team suggested that the company cut back on recommending celebrities for people to follow, or reweight Instagram’s feed to include less celebrity and fashion content, or de-emphasize comments about people’s appearance. As a fellow employee noted in response to summaries of these proposals on Workplace, the Well-Being team was suggesting that Instagram become less like Instagram.

“Isn’t that what IG is mostly about?” the man wrote. “Getting a peek at the (very photogenic) life of the top 0.1%? Isn’t that the reason why teens are on the platform?”
“We are practically not doing anything,” the researchers had written, noting that Instagram wasn’t currently able to stop itself from promoting underweight influencers and aggressive dieting. A test account that signaled an interest in eating disorder content filled up with pictures of thigh gaps and emaciated limbs.

The problem would be relatively easy for outsiders to document. Instagram was, the research warned, “getting away with it because no one has decided to dial into it.”
He began the presentation by noting that 51 percent of Instagram users reported having a “bad or harmful” experience on the platform in the previous seven days. But only 1 percent of those users reported the objectionable content to the company, and Instagram took action in 2 percent of those cases. The math meant that the platform remediated only 0.02 percent of what upset users—just one bad experience out of every 5,000.

“The numbers are probably similar on Facebook,” he noted, calling the statistics evidence of the company’s failure to understand the experiences of users such as his own daughter. Now sixteen, she had recently been told to “get back to the kitchen” after she posted about cars, Bejar said, and she continued receiving the unsolicited dick pics she had been getting since the age of fourteen. “I asked her why boys keep doing that? She said if the only thing that happens is they get blocked, why wouldn’t they?”

Two years of research had confirmed that Joanna Bejar’s logic was sound. On a weekly basis, 24 percent of all Instagram users between the ages of thirteen and fifteen received unsolicited advances, Bejar informed the executives. Most of that abuse didn’t violate the company’s policies, and Instagram rarely caught the portion that did.
nothing highlighted the costs better than a Twitter bot set up by New York Times reporter Kevin Roose. Using methodology created with the help of a CrowdTangle staffer, Roose found a clever way to put together a daily top ten of the platform’s highest-engagement content in the United States, producing a leaderboard that demonstrated how thoroughly partisan publishers and viral content aggregators dominated the engagement signals that Facebook valued most.

The degree to which that single automated Twitter account got under the skin of Facebook’s leadership would be difficult to overstate. Alex Schultz, the VP who oversaw Facebook’s Growth team, was especially incensed—partly because he considered raw engagement counts to be misleading, but more because it was Facebook’s own tool reminding the world every morning at 9:00 a.m. Pacific that the platform’s content was trash.

“The reaction was to prove the data wrong,” recalled Brian Boland. But efforts to employ other methodologies only produced top ten lists that were nearly as unflattering. Schultz began lobbying to kill off CrowdTangle altogether, replacing it with periodic top content reports of its own design. That would still be more transparency than any of Facebook’s rivals offered, Schultz noted

...

Schultz handily won the fight. In April 2021, Silverman convened his staff on a conference call and told them that CrowdTangle’s team was being disbanded. ... “Boz would just say, ‘You’re completely off base,’ ” Boland said. “Data wins arguments at Facebook, except for this one.”
When the company issued its response later in May, I read the document with a clenched jaw. Facebook had agreed to grant the board’s request for information about XCheck and “any exceptional processes that apply to influential users.”

...

“We want to make clear that we remove content from Facebook, no matter who posts it,” Facebook’s response to the Oversight Board read. “Cross check simply means that we give some content from certain Pages or Profiles additional review.”

There was no mention of whitelisting, of C-suite interventions to protect famous athletes, of queues of likely violating posts from VIPs that never got reviewed. Although our documents showed that at least 7 million of the platform’s most prominent users were shielded

by some form of XCheck, Facebook assured the board that it applied to only “a small number of decisions.” The only XCheck-related request that Facebook didn’t address was for data that might show whether XChecked users had received preferential treatment.

“It is not feasible to track this information,” Facebook responded, neglecting to mention that it was exempting some users from enforcement entirely.
“I’m sure many of you have found the recent coverage hard to read because it just doesn’t reflect the company we know,” he wrote in a note to employees that was also shared on Facebook. The allegations didn’t even make sense, he wrote: “I don’t know any tech company that sets out to build products that make people angry or depressed.”

Zuckerberg said he worried the leaks would discourage the tech industry at large from honestly assessing their products’ impact on the world, in order to avoid the risk that internal research might be used against them. But he assured his employees that their company’s internal research efforts would stand strong. “Even though it might be easier for us to follow that path, we’re going to keep doing research because it’s the right thing to do,” he wrote.

By the time Zuckerberg made that pledge, research documents were already disappearing from the company’s internal systems. Had a curious employee wanted to double-check Zuckerberg’s claims about the company’s polarization work, for example, they would have found that key research and experimentation data had become inaccessible.

The crackdown had begun.
One memo required researchers to seek special approval before delving into anything on a list of topics requiring “mandatory oversight”—even as a manager acknowledged that the company did not maintain such a list.
The “Narrative Excellence” memo and its accompanying notes and charts were a guide to producing documents that reporters like me wouldn’t be excited to see. Unfortunately, as a few bold user experience researchers noted in the replies, achieving Narrative Excellence was all but incompatible with succeeding at their jobs. Writing things that were “safer to be leaked” meant writing things that would have less impact.

Appendix: non-statements

I really like the "non-goals" section of design docs. I think the analogous non-statements section of a doc like this is much less valuable because the top-level non-statements can generally be inferred by reading this doc, whereas top-level non-goals often add information, but I figured I'd try this out anyway.

Facebook (or any other company named here, like Uber) is uniquely bad
- As discussed, on the contrary, I think Facebook isn't very atypical, which is why
Zuckerberg (or any other person named) is uniquely bad
Big tech employees are bad people
No big tech company employees are working hard or trying hard
- For some reason, a common response to any criticism of a tech company foible or failure is "people are working hard". This is almost never a response to a critique that nobody is working hard, and that is once again not the critique here
Big tech companies should be broken up or otherwise have antitrust action taken against them
- Maybe so, but this document doesn't make that case
Bigger companies in the same industry are strictly worse than smaller companies
- Discussed above, but I'll mention it again here
The general bigness vs. smallness tradeoff as discussed here applies strictly across all areas all industries
- Also mentioned above, but mentioned again here. For example, the percentage of rides in which a taxi drier tries to scam the user seems much higher with traditional taxis than with Uber
It's easy to do moderation and support at scale
On average, large companies provide a worse experience for users
- For example, I still use Amazon because it gives me the best overall experience. As noted above, cost and shipping are better with Amazon than with any other alternative. There are entire classes of items where most things I've bought are counterfeit, such as masks and respirators. When I bought these in January 2020, before these were something many people would buy, I got genuine 3M masks. Masks and filters were then hard to get for a while, and then when they became available again, the majority of 3M masks and filters I got were counterfeit (out of curiosity, I tried more than a few independent orders over the next few years). I try to avoid classes of items that have a high counterfeit rate (but a naive user who doesn't know to do this will buy a lot of low-quality counterfeits), and I know I'm rolling the dice every time I buy any expensive item (if I get a counterfeit or an empty box, Amazon might not accept the return or refund me unless I can make a viral post about the issue), and sometimes a class of item goes from being one where you can usually get good items to one where most items are counterfeit.
- Many objections are, implicitly or explicitly, are about the average experience, but this is nonsensical when the discussion is about the experience in the tail; this is like the standard response you see when someone notes that a concurrency bug is a problem and someone else say it's fine because "it works for me", which doesn't make sense for bugs that occur in the tail.

when Costco was smaller, I would've put Costco here instead of Best Buy, but as they've gotten bigger, I've noticed that their quality has gone down. It's really striking how (relatively) frequently I find sealed items like cheese going bad long before their "best by" date or just totally broken items. This doesn't appear to have anything to do with any particular location since I moved almost annually for close to a decade and observed this decline across many different locations (because I was moving, at first, I thought that I got unlucky with where I'd moved to, but as I tried locations in various places, I realized that this wasn't specific to any location and it seems to have impacted stores in both the U.S. and Canada). ^[return]
when the WSJ looked at leaked internal Meta documents, they found, among other things, that Meta estimated that 100k minors per day "received photos of adult genitalia or other sexually abusive content". Of course, smart contrarians will argue that this is totally normal, e.g., two of the first few comments on HN were about how there's nothing particularly wrong with this. Sure, it's bad for children to get harassed, but "it can happen on any street corner", "what's the base rate to compare against", etc.

Very loosely, if we're liberal, we might estimate that Meta had 2.5B DAU in early 2021 and 500M were minors, or if we're conservative, maybe we guess that 100M are minors. So, we might guess that Meta estimated something like 0.1% to 0.02% of minors on Meta platforms received photos of genitals or similar each day. Is this roughly the normal rate they would experience elsewhere? Compared to the real world, possibly, although I would be surprised if 0.1% of children are being exposed to people's genitals "on any street corner". Compared to a well moderated small forum, that seems highly implausible. The internet commenter reaction was the same reaction that Arturo Bejar, who designed Facebook's reporting system and worked in the area, had. He initially dismissed reports about this kind of thing because it didn't seem plausible that it could really be that bad, but he quickly changed his mind once he started looking into it:

Joanna’s account became moderately successful, and that’s when things got a little dark. Most of her followers were enthused about a [14-year old] girl getting into car restoration, but some showed up with rank misogyny, like the guy who told Joanna she was getting attention “just because you have tits.”

“Please don’t talk about my underage tits,” Joanna Bejar shot back before reporting the comment to Instagram. A few days later, Instagram notified her that the platform had reviewed the man’s comment. It didn’t violate the platform’s community standards.

Bejar, who had designed the predecessor to the user-reporting system that had just shrugged off the sexual harassment of his daughter, told her the decision was a fluke. But a few months later, Joanna mentioned to Bejar that a kid from a high school in a neighboring town had sent her a picture of his penis via an Instagram direct message. Most of Joanna’s friends had already received similar pics, she told her dad, and they all just tried to ignore them.

Bejar was floored. The teens exposing themselves to girls who they had never met were creeps, but they presumably weren’t whipping out their dicks when they passed a girl in a school parking lot or in the aisle of a convenience store. Why had Instagram become a place where it was accepted that these boys occasionally would—or that young women like his daughter would have to shrug it off?

Much of the book, Broken Code, is about Bejar and others trying to get Meta to take problems like this seriously and making little progress and often having their progress undone (although, PR issues for FB seem to force FB's hand and drive some progress towards the end of the book):

six months prior, a team had redesigned Facebook’s reporting system with the specific goal of reducing the number of completed user reports so that Facebook wouldn’t have to bother with them, freeing up resources that could otherwise be invested in training its artificial intelligence–driven content moderation systems. In a memo about efforts to keep the costs of hate speech moderation under control, a manager acknowledged that Facebook might have overdone its effort to stanch the flow of user reports: “We may have moved the needle too far,” he wrote, suggesting that perhaps the company might not want to suppress them so thoroughly.

The company would later say that it was trying to improve the quality of reports, not stifle them. But Bejar didn’t have to see that memo to recognize bad faith. The cheery blue button was enough. He put down his phone, stunned. This wasn’t how Facebook was supposed to work. How could the platform care about its users if it didn’t care enough to listen to what they found upsetting?

There was an arrogance here, an assumption that Facebook’s algorithms didn’t even need to hear about what users experienced to know what they wanted. And even if regular users couldn’t see that like Bejar could, they would end up getting the message. People like his daughter and her friends would report horrible things a few times before realizing that Facebook wasn’t interested. Then they would stop.

If you're interested in the topic, I'd recommend reading the whole book, but if you just want to get a flavor for the kinds of things the book discusses, I've put a few relevant quotes into an appendix. After reading the book, I can't say that I'm very sure the number is correct because I'd have to look at the data to be strongly convinced, but it does seem plausible. And as for why Facebook might expose children to more of this kind of thing than another platform, the book makes the case that this falls out of a combination of optimizing for engagement, "number go up", and neglecting "trust and safety" work

Only a few hours of poking around Instagram and a handful of phone calls were necessary to see that something had gone very wrong—the sort of people leaving vile comments on teenagers’ posts weren’t lone wolves. They were part of a large-scale pedophilic community fed by Instagram’s recommendation systems.

Further reporting led to an initial three-thousand-word story headlined “Instagram Connects Vast Pedophile Network.” Co-written with Katherine Blunt, the story detailed how Instagram’s recommendation systems were helping to create a pedophilic community, matching users interested in underage sex content with each other and with accounts advertising “menus” of content for sale. Instagram’s search bar actively suggested terms associated with child sexual exploitation, and even glancing contact with accounts with names like Incest Toddlers was enough to trigger Instagram to begin pushing users to connect with them.

^[return]
but, fortunately for Zuckerberg, his target audience seems to have little understanding of the tech industry, so it doesn't really matter that Zuckerberg's argument isn't plausible. In a future post, [we might look at incorrect reasoning from regulators and government officials but, for now, see this example of Gary Bernhardt where FB makes a claim that appears to be the opposite of correct to people who work in the area. ^[return]
Another claim, rarer than "it would cost too much to provide real support", is "support can't be done because it's a social engineering attack vector". This isn't as immediately implausible because this calls to mind all of the cases where people had their SMS-2FA'd accounts owned by someone calling up a phone company and getting a phone number transferred, but I don't find it all that plausible since bank and brokerage accounts are, in general, much higher value than FB accounts and FB accounts are still compromised at a much higher rate, even for online-only accounts, accounts back before KYC requirements were in play, or whatever other reason people name as a reasonable-sounding reason for the difference. ^[return]
Another reason, less reasonable, but the actual impetus for this post, is that when Zuckerberg made his comments that only the absolute largest companies in the world can handle issues like fraud and spam, it struck me as completely absurd and, because I enjoy absurdity, I started a doc where I recorded links I saw to large company spam, fraud, moderation, and support, failures, much like the list of Google knowledge card results I kept track of for a while. I didn't have a plan for what to do with that and just kept it going for years before I decided to publish the list, at which point I felt that I had to write something, since the bare list by itself isn't that interesting, so I started writing up summaries of each link (the original list was just a list of links), and here we are. When I sit down to write something, I generally have an idea of the approach I'm going to take, but I frequently end up changing my mind when I start looking at the data.

For example, since going from hardware to software, I've had this feeling that conventional software testing is fairly low ROI, so when I joined Twitter, I had this idea that I would look at the monetary impact of errors (e.g., serving up a 500 error to a user) and outages and use that to justify working on testing, in the same way that studies looking into the monetary impact of latency can often drive work on latency reduction. Unfortunately for my idea, I found that a naive analysis found a fairly low monetary impact and I immediately found a number of other projects that were high impact, so I wrote up a doc explaining that my findings were the opposite of what I needed to justify doing the work that I wanted to do, but I hoped to do a more in-depth follow-up that could overturn my original result, and then worked on projects that were supported by data.

This also frequently happens when I write things up here, such as this time I wanted to write up this really compelling sounding story, but, on digging into it, despite it being widely cited in tech circles, I found out that it wasn't true and there wasn't really any interesting there. It's qute often that when I look into something, I find that the angle of I was thinking of doesn't work. When I'm writing for work, I usually feel compelled to at least write up a short doc with evidence of the negative result but, for my personal blog, I don't really feel the same compulsion, so my drafts folder and home drive are littered with abandoned negative results.

However, in this case, on digging into the stories in the links and talking to people at various companies about how these systems work, the problem actually seemed worse than I realized before I looked into it, so it felt worth writing up even if I'm writing up something most people in tech know to be true.
^[return]

Why it's impossible to agree on what's allowed

Wed, 07 Feb 2024 00:00:00 +0000

On large platforms, it's impossible to have policies on things like moderation, spam, fraud, and sexual content that people agree on. David Turner made a simple game to illustrate how difficult this is even in a trivial case, No Vehicles in the Park. If you haven't played it yet, I recommend playing it now before continuing to read this document.

The idea behind the site is that it's very difficult to get people to agree on what moderation rules should apply to a platform. Even if you take a much simpler example, what vehicles should be allowed in a park given a rule and some instructions for how to interpret the rule, and then ask a small set of questions, people won't be able to agree. On doing the survey myself, one of the first reactions I had was that the questions aren't chosen to be particularly nettlesome and there are many edge cases Dave could've asked about if he wanted to make it a challenge. And yet, despite not making the survey particularly challenging, there isn't broad agreement on the questions. Comments on the survey also indicate another problem with rules, which is that it's much harder to get agreement than people think it will be. If you read comments on rule interpretation or moderation on lobsters, HN, reddit, etc., when people suggest a solution, the vast majority of people will suggest something that anyone who's done moderation or paid attention to how moderation works knows cannot work, the moderation equivalent of "I could build that in a weekend"¹. Of course we see this on Dave's game as well. The top HN comment, the most agree-upon comment, and a very common sentiment elsewhere is²:

I'm fascinated by the fact that my takeaway is the precise opposite of what the author intended.

To me, the answer to all of the questions was crystal-clear. Yes, you can academically wonder whether an orbiting space station is a vehicle and whether it's in the park, but the obvious intent of the sign couldn't be clearer. Cars/trucks/motorcycles aren't allowed, and obviously police and ambulances (and fire trucks) doing their jobs don't have to follow the sign.

So if this is supposed to be an example of how content moderation rules are unclear to follow, it's achieving precisely the opposite.

And someone agreeingly replies with:

Exactly. There is a clear majority in the answers.

After going through the survey, you get a graph showing how many people answered yes and no to each question, which is where the "clear majority" comes from. First of all, I think it's not correct to say that there is a clear majority. But even supposing that there were, there's no reason to think that there being a majority means that most people agree with you even if you take the majority position in each vote. In fact, given how "wiggly" the per-question majority graph looks, it would be extraordinary if it were the case that being in the majority for each question meant that most people agreed with you or that there's any set of positions that the majority of people agree on. Although you could construct a contrived dataset where this is true, it would be very surprising if this were true in a natural dataset.

If you look at the data (which isn't available on the site, but Dave was happy to pass it along when I asked), as of when I pulled the data, there was no set of answers which the majority of users agreed on and it was not even close. I pulled this data shortly after I posted on the link to HN, when the vast majority of responses were HN readers, who are more homogeneous than the population at large. Despite these factors making it easier to find agreement, the most popular set of answers was only selected by 11.7% of people. This is the position the top commenter says is "obvious", but it's a minority position not only in the sense that only 11.7% of people agree and 88.3% of people disagree, almost no one holds a position with only a small amount of disagreement from this allegedly obvious position. The 2nd and 3rd most common positions, representing 8.5% and 6.5% of the vote, respectively, are similar and only disagree on whether or not a non-functioning WW-II era tank that's part of a memorial violates the rule. Beyond that, approximately 1% of people hold the 4th, 5th, 6th, and 7th most popular positions, with every less popular position having less than 1% agreement, with a fairly rapid drop from there as well. So, 27% of people find themselves in agreement with significantly more than 1% of other users (the median user agrees with 0.16% of other users). See below for a plot of what this looks like. The opinions are sorted from most popular to least popular, with the most popular on the left. A log scale is used because there's so little agreement on opinions that a linear scale plot looks like a few points above zero followed by a bunch of zeros.

Another way to look at this data is that 36902 people expressed an opinion on what constitutes a vehicle in the park and they came up with 9432 distinct opinions, for an average of ~3.9 people, per distinct expressed opinion. i.e., the average user agreement is ~0.01%. Although averages are, on average, overused, an average works as a summary for expressing the level of agreement because while we do have a small handful of opinions with much higher than the average 0.01% agreement, to "maintain" the average, this must be balanced out by a ginormous number of people who have even less agreement with other users. There's no way to have a low average agreement with high actual agreement unless that's balanced out by even higher disagreement, and vice versa.

On HN, in response to the same comment, Michael Chermside had the reasonable but not highly upvoted comment,

> To me, the answer to all of the questions was crystal-clear.

That's not particularly surprising. But you may be asking the wrong question.

If you want to know whether the rules are clear then I think that the right question to ask is not "Are the answers crystal-clear to you?" but "Will different people produce the same answers?".

If we had a sharp drop in the graph at one point then it would suggest that most everyone has the same cutoff; instead we see a very smooth curve as if different people read this VERY SIMPLE AND CLEAR rule and still didn't agree on when it applied.

Many (and probably actually most) people are overconfident when predicting what other people think is obvious and often incorrectly assume that other people will think the same thoughts and find the same things obvious. This is more true of the highly-charged issues that result in bitter fights about moderation than the simple "no vehicles in the park" example, but even this simple example demonstrates not only the difficulty in reaching agreement, but the difficulty in understanding how difficult it is to reach agreement.

To use an example from another context that's more charged, consider in any sport and whether or not a player is considered to be playing fair or is making dirty plays and should be censured. We could look at many different players from many different sports, so let's arbitrarily pick Draymond Green. If you ask any serious basketball fan who's not a Warriors fan, who's the dirtiest player in the NBA today, you'll find general agreement that it's Draymond Green (although some people will argue for Dillon Brooks, so if you want near uniform agreement, you'll have to ask for the top two dirtiest players). And yet, if you ask a Warriors fan about Draymond, most have no problem explaining away every dirty play of his. So if you want to get uniform agreement to a question that's much more straightforward than the "no vehicles in the park" question, such as, "is it ok to stomp on another player's chest and then use them as a springboard to leap into the air? on top of a hundred other dirty plays", you'll find that for many such seemingly obvious questions, a sizable group of people will have extremely strong disagreements with the "obvious" answer. When you move away from a contrived, abstract, example like "no vehicles in the park" to a real-world issue that people have emotional attachments to, it generally becomes impossible to get agreement even in cases where disinterested third parties would all agree, which we observed is already impossible even without emotional attachment. And when you move away from sports into issues people care even more strongly about, like politics, the disagreements get stronger.

While people might be able to "agree to disagree" on whether or not a a non-functioning WW-II era tank that's part of a memorial violates the "no vehicles in the park" rule (resulting in a pair of positions that accounts for 15% of the vote), in reality, people often have a hard time agreeing to disagree over what outsiders would consider very small differences of opinion. Charged issues are often fractally contentious, causing disagreement among people who hold all but identical opinions, making them significantly more difficult to agree on than our "no vehicles in the park" example.

To pick a real-world example, consider Jo Freeman, a feminist who, in 1976, wrote about her experienced being canceled for minute differences in opinion and how this was unfortunately common in the Movement (using the term "trashed" and not "canceled" because cancellation hadn't come into common usage yet and, in my opinion, "trashed" is the better term anyway). In the nearly fifty years since Jo Freeman wrote "Trashing", the propensity of humans to pick on minute differences and attempt to destroy anyone who doesn't completely agree with them hasn't changed; for a recent, parallel, example, Natalie Wynn's similar experience.

For people with opinions far away in the space of commonly held opinions, the differences in opinion between Natalie and the people calling for her to be deplatformed are fairly small. But, not only did these "small" differences in opinion result in people calling for Natalie to be deplatformed, they called for her to be physically assaulted, doxed, etc., and they suggested the same treatment suggested for her friends and associates as well as people who didn't really associate with her, but publicly talked about similar topics and didn't cancel her. Even now, years later, she still gets calls to be deplatformed and I expect this will continue past the end of my life (when I wrote this, years after the event Natalie discussed, I did a Twitter search and found a long thread from someone ranting about what a horrible human being Natalie is for the alleged transgression discussed in the video, dated 10 days ago, and it's easy to find more of these rants). I'm not going to attempt to describe the difference in positions because the positions are close enough that, to describe them would take something like 5k to 10k words (as opposed to, say, a left-wing vs. a right-wing politician, where the difference is blatant enough that you can describe in a sentence or two); you can watch the hour in the 1h40m video that's dedicated to the topic if you want to know the full details.

The point here is just that, if you look at almost any person who has public opinions on charged issues, the opinion space is fractally contentious. No large platform can satisfy user preferences because users will disagree over what content should be moderated off the platform and what content should be allowed. And, of course, this problem scales up as the platform gets larger³.

Thanks to Peter Bhat Harkins, Dan Gackle, Laurence Tratt, Gary Bernhardt, David Turner, Kevin Burke, Sophia Wisdom, Justin Blank, and Bert Muthalaly for comments/corrections/discussion.

Something I've repeatedly seen on every forum I've been on is the suggestion that we just don't need moderation after all and all our problems will be solved if we just stop this nasty censorship. If you want a small forum that's basically 4chan, then no moderation can work fine, but even if you want a big platform that's like 4chan, no moderation doesn't actually work. If we go back to those Twitter numbers, 300M users and 1M bots removed a day, if you stop doing this kind of "censorship", the platform will quickly fill up with bots to the point that everything you see will be spam/scam/phishing content or content from an account copying content from somewhere else or using LLM-generated content to post scam/scam/phishing content. Not only will most accounts be bots, bots will be a part of large engagement/voting rings that will drown out all human content.

The next most naive suggestion is to stop downranking memes, dumb jokes, etc., often throw in with a comment like "doesn't anyone here have a sense of humor?". If you look at why forums with upvoting/ranking ban memes, it generally happens after the forum becomes totally dominated by memes/comics because people upvote those at a much higher rate than any kind of content with a bit of nuance, and not everyone wants a forum that's full of the lowest common denominator meme/comic content. And as for "having a sense of humor" in comments, if you look forums that don't ban cheap humor, top comments will generally end up dominated by these, e.g., for maybe 3-6 months, one the top comments on any kind of story about a man doing anything vaguely heroic on reddit forums that don't ban this kind of cheap was some variant of "I'm surprised he can walk with balls that weigh 900 lbs.", often repeated multiple times by multiple users, amidst a sea of the other cheap humor that was trendy during that period. Of course, some people actually want that kind of humor to dominate the comments, they actually want to see the same comment 150 times a day for months on end, but I suspect most people who grumpily claim "no one has a sense of humor here" when their cheap humor gets flagged don't actually want to read a forum that's full of other people's cheap humor.
^[return]
This particular commenter indicates that they understand that moderation is, in general, a hard problem; they just don't agree with the "no vehicles in the park" example, but many other people think that both the park example and moderation are easy. ^[return]
Nowadays, it's trendy to use "federation" as a cure-all in the same way people used "blockchain" as a cure-all five years ago, but federation doesn't solve this problem for the typical user. I actually had a conversation with someone who notes in their social media bio that they're one of the creators of the ActivityPub spec, who claimed that federation does solve this problem and that Threads adding ActivityPub would create some kind of federating panacea. I noted that fragmentation is already a problem for many users on Mastodon and whether or not Threads will be blocked is contentious and will only increase fragmentation, and the ActivityPub guy replied with something like "don't worry about that, most people won't block Threads, and it's their problem if they do."

I noted that a problem many of my non-technical friends had when they tried Mastodon was that they'd pick a server and find that they couldn't follow someone they wanted to follow due to some kind of server blocking or ban. So then they'd try another server to follow this one person and then find that another person they wanted to follow is blocked. The fundamental problem is that users on different servers want different things to be allowed, which then results in no server giving you access to everything you want to see. The ActivityPub guy didn't have a response to this and deleted his comment.

By the way, a problem that's much easier than moderation/spam/fraud/obscene content/etc. policy that the fediverse can't even solve is how to present content. Whenever I use Mastodon to interact with someone using "honk", messages get mangled. For example, a Mastodon message " in the subject (and content warning) field gets converted to " the Mastodon user sees the reply from the honk user, so every reply from a honk user forks the discussion into a different subject. Here's something that can be fully specified without ambiguity, where people are much less emotionally attached to the subject than they are for moderation/spam/fraud/obscene content/etc., and the fediverse can't even solve this problem across two platforms.
^[return]

Notes on Cruise's pedestrian accident

Mon, 29 Jan 2024 00:00:00 +0000

This is a set of notes on the Quinn Emanuel report on Cruise's handling of the 2023-10-02 accident where a Cruise autonomous vehicle (AV) hit a pedestrian, stopped, and then started moving again with the pedestrian stuck under the bottom of the AV, dragging the pedestrian 20 feet. After seeing some comments about this report, I read five stories on this report and then skimmed the report and my feeling is that the authors of four of the stories probably didn't read the report, and that people who were commenting had generally read stories by journalists who did not appear to read the source material, so the comments were generally way off base. As we previously discussed, it's common for summaries to be wildly wrong, even when they're summarizing a short paper that's easily read by laypeople, so of course summaries of a 200-page report are likely to be misleading at best.

On reading the entire report, I'd say that Cruise both looks better and worse than in the articles I saw, which is the same pattern we saw when we looked at the actual source for Exhibits H and J from Twitter v. Musk, the United States v. Microsoft Corp. docs, etc.; just as some journalists seem to be pro/anti-Elon Musk and pro/anti-Microsoft, willing to push an inaccurate narrative to dunk on them to the maximum extent possible or exonerate them to the maximum extent possible, we see the same thing here with Cruise. And as we saw in those cases, despite some articles seemingly trying to paint Cruise in the best or worst light possible, the report itself has material that is more positive and more negative than we see in the most positive or negative stories.

Aside from correcting misleading opinions on the report, I find the report interesting because it's rare to see any kind of investigation over what went wrong in tech in this level of detail, let alone a public one. We often see this kind of investigation in safety critical systems and sometimes see in sports as well as for historical events, but tech events are usually not covered like this. Of course companies do post-mortems of incidents, but you generally won't see a 200-page report on a single incident, nor will the focus of post-mortems be what the focus was here. In the past, we've noted that a lot can be learned by looking at the literature and incident reports on safety-critical systems, so of course this is true here as well, where we see a safety-critical system that's more tech adjacent than the ones we've looked at previously.

The length and depth of the report here reflects a difference in culture between safety-critical systems and "tech". The behavior that's described as unconscionable in the report is not only normal in tech, but probably more transparent and above board than you'd see at most major tech companies; I find the culture clash between tech and safety-critical systems interesting as well. I attempted to inject as little of my opinion as possible into the report as possible, even in cases where knowledge of tech companies or engineering meant that I would've personally written something different. For more opinions, see the section at the end.

REPORT TO THE BOARDS OF DIRECTORS OF CRUISE LLC, GM CRUISE HOLDINGS LLC, AND GENERAL MOTORS HOLDINGS LLC REGARDING THE OCTOBER 2, 2023 ACCIDENT IN SAN FRANCISCO

I. Introduction

A. Overview

2023-10-24: California DMV suspended Cruise's driverless license
2023-10-02: human-drive Nissan hit a pedestrian, putting the pedestrian in the path of a Cruise autonomous vehicle (AV), which then dragged the pedestrian 20 feet before stopping
DMV claims
- Cruise failed to disclose that the AV moved forward after its initial impact
- video Cruise played only shows a part of the accident and not the pedestrian dragging
- DMV only learned about dragging from another government agency, "impeding its oversight"
NHTSA and CPUC also took action against Cruise and made similar claims
Media outlets also complained they were misled by Cruise
Cruise leadership and Cruise employees who talked with regulators admit they didn't explain the dragging, but they said they played the full video clip but, in all but one of the meetings, internet issues may have prevented regulators from seeing the entire accident
Cruise employees claim the NHTSA received the full video immediately after the 10-03 meeting and the CPUC declined the offer for the full video
Cruise employees note they played the full video, with no internet issues, to the SF MTA, SFPD, and SFFD on 10-03 and had a full discussion with those agencies
Cruise leadership concedes they never informed the media, but leadership believed that Cruise's obligations to the media are different than their obligations to regulators

B. Scope of Review

[no notes]

C. Review Plan Methodology and Limitations

205k "documents", including " including e-mails, texts, Slack communications, and internal Cruise documents"
Interviewed 88 current and former employees and contractors
Reviewed a report by Exponent Inc., 3rd party firm
Internal review only; did not interview regulators and public officials
A number of employees and contractors were not available "due to personal circumstances and/or the wide-scale Reduction in Force", but these interviews were not deemed to be important
Report doesn't address broader issues outside of mandate, "such as the safety or safety processes of Cruise AVs or its operations, which are more appropriately evaluated by those with engineering and technical safety expertise"

D. Summary of Principal Findings and Conclusions

By morning of 10-03, leadership and 100+ employees knew the pedestrian had been dragged ~20ft. by the Cruise AV during the secondary movement after the AV came to a stop.
Plan was to disclose this happened by playing the full video, "letting the 'video speak for itself.'"
- Cruise assumed that regulators and government officials would ask questions and Cruise would provide further info
"Weight of the evidence" is that Cruise attempted to play the full video, but in 3 meetings, internet issues prevented this from happening and Cruise didn't point out that the pedestrian dragging happened
On 10-02 and 10-03, "Cruise leadership was fixated on correcting the inaccurate media narrative" that Cruise's AV had caused the accident
- This led Cruise to convey information about the Nissan and omit "other important information" about the accident to "the media, regulators, and other government officials"
"The reasons for Cruise’s failings in this instance are numerous: poor leadership, mistakes in judgment, lack of coordination, an 'us versus them' mentality with regulators, and a fundamental misapprehension of Cruise’s obligations of accountability and transparency to the government and the public. Cruise must take decisive steps to address these issues in order to restore trust and credibility."
"the DMV Suspension Order is a direct result of a proverbial self-inflicted wound by certain senior Cruise leadership and employees who appear not to have fully appreciated how a regulated business should interact with its regulators ... it was a fundamentally flawed approach for Cruise or any other business to take the position that a video of an accident causing serious injury provides all necessary information to regulators and otherwise relieves them of the need to affirmatively and fully inform these regulators of all relevant facts. As one Cruise employee stated in a text message to another employee about this matter, our 'leaders have failed us.'"

II. THE FACTS REGARDING THE OCTOBER 2 ACCIDENT

A. Background Regarding Cruise’s Business Operations

Cruise founded in 2013, acquired by GM in 2016 (GM owns 79%)
Cruise's stated goal: "responsibly deploy the world’s most advanced driverless vehicle service"
"Cruise’s stated mission is to make transportation cleaner, safer and more accessible"
Driverless ride-hail operation started 2021-09 in SF
Started charging in 2022-06
Has expanded to other areas, including overseas
10-02 accident was first pedestrian injury in > 5M mi of driving

B. Key Facts Regarding the Accident

10-02, 9:29pm: human-driven Nissan Sentra strikes pedestrian in crosswalk of 4-way intersection at 5th & Market in SF
Pedestrian entered crosswalk against a red light and "Do Not Walk" signal and then paused in Nissan's lane. Police report cited both driver and pedestrian for code violations and concluded that the driver was "most at fault"
The impact launched the pedestrian into the path of the Crusie AV
Cruise AV braked but still hit pedestrian
After coming to a complete stop, the AV moved to find a safe place to stop, known as a "'minimal risk condition' pullover maneuver (pullover maneuver) or 'secondary movement.'"
AV drove up to 7.7mph for 20 feet, dragging pedestrian with it
Nissan driver fled the scene (hit and run)

C. Timeline of Key Events

10-02, 9:29pm: accident occurs and Nissan driver flees; AV transits low-res 3-second video (Offload 1) confirming collison to Cruise Remote Assistance Center
9:32pm: AV transmits medium-res 14-second video (Offload 2) of collision, but not pullover maneuver and pedestrian dragging
9:33pm: emergency responders arrive between 9:33pm and 9:38pm
9:40pm: SFFD uses heavy rescue tools to remove pedestrian from under AV
9:49pm: Cruise Incident Response team labels accident a "Sev-1", which is for minor collisions. Team created a virtual "war room" on Google Meet and a dedicated slack channel (war room slack channel) with ~20 employees
10:17pm: Cruise contractors arrive at accident scene. One contractor takes > 100 photos and videos and notes blood and skin patches on the ground, showing the AV moved from point-of-impact to final stopping place
- Another contractor, with Cruise's authorization, gives SFPD 14-second video showing Nissan
11:31pm: Cruise raises incident to "Sev-0", for "major vehicle incident with moderate to major injury or fatality to any party". Maybe 200 additional employees are paged to the war room
10-03, 12:15am: incident management convenes virtual meeting to share updates about accident and discuss media strategy to rebut articles that AV caused the accident
12:45am: Cruise govt affairs team reaches out to govt officials
12:53am: Cruise issues a press release noting the Nissan caused the accident. CEO Kyle Vogt and Communications VP Aaron McLear heavily edit the press statement. No mention of pullover maneuver or dragging; Cruise employees claim they were not aware of those facts at the time
1:30am: AV back at Cruise facility, start the process of downloading collision report data from AV, including full video
2:14am: 45-second video of accident which depicts pullover maneuver and dragging available, but no Cruise employee receives a notification that it's ready until > 4 hours later, when all data from AV is processed
3:21am: At the request of Cruise govt. Affairs, Director of Systems Integrity Matt Wood creates 12s video of accident showing Nissan hitting pedestrian and pedestrian landing in front of Cruise AV. Video stops before AV hits pedestrian
3:45am: Wood posts first known communication within Cruise of pullover maneuver and pedestrian dragging to war room slack channel with 77 employees in the channel at the time. Wood says the AV moved 1-2 car lengths after initial collision
6:00am: Cruise holds virtual Crisis Management Team (CMT) meeting; pedestrian dragging is discussed. Subsequent slack messages (6:17am, 6:25am, 6:56am) confirm discussion on pullover maneuver and dragging
6:28am: Cruise posts 45s 9-pane video of pullover and dragging, the full video (offload 3) to war room slack channel
6:45am: Virtual Senior Leadership Team (SLT) meeting; Vogt and McLear discuss whether or not to share full video with media or alter Cruise press statement and decide to do neither
7:25am: Cruise govt. affairs employee emails NHTSA and offers to meet
7:45am: Cruise eng and safety teams hold preliminary meetings to discuss collision and pullover maneuver
9:05am: Cruise regulator, legal, and systems integrity employees have pre-meeting to prepare for NHTSA briefing; they discuss pullover and dragging
10:05am: Wood and VP of Global Government Affairs Prashanthi Raman have virtual meeting with Mayor of SF's transpo advisor. Wood shows full video, "reportedly with internet connectivity issues from his home computer"; neither Wood nor Raman brings up or discusses pullover or dragging
10:30am: Virtual meeting with NHTSA. Wood shows full video, "again having internet connectivity issues causing video to freeze and/or black-out in key places including after initial impact" and again not bringing up or discussing pullover or dragging
10:35am: Cruise eng and safety teams have 2nd meeting to discuss collision
11:05am: Cruise regulatory, legal, and systems integrity employees have pre-meeting for DMV and California Highway Patrol (CHP) briefing; Cruise team doesn't discuss pullover and dragging
11:30am: hybrid in-person and virtual meeting with DMV and CHP. Wood shows full video, again with internet connectivity issues and again not bringing up or discussing pullover or dragging
12:00pm: virtual Cruise CMT meeting; engineers present findings, including chart detailing movement of AV during accident. Shows how AV collided with pedestrian and then moved forward again, dragging pedestrian ~20 ft. AV programmed to move as much as 100 ft, but internal AV systems flagged a failed wheel speed sensor because wheels were moving at different speeds (because one wheel was spinning on pedestrian's leg), stopping the car early
12:30pm: Cruise govt affairs employee calls CPUC to discuss 10-02 accident and video
12:40pm: Cruise virtual SLT meeting. Chart from CMT meeting presented. Vogt, COO Gil West, Chief Legal Officer Jeff Bleich, and others present. Safety and eng teams raise question of grounding fleet; Vogt and West say no
1:40pm: full video uploaded to NHTSA
2:37pm: Cruise submits 1-day report to NHTSA; no mention of pullover or dragging
3:30pm: Cruise virtual meeting with SF MTA, SFPD, and SFFD. Wood "shows full video several times" without technical difficulties. Cruise doesn't bring up pullover maneuver or dragging, but officials see it and ask Cruise questions about it
6:05pm: Cruise CMT meeting. Vogt and West end Sev-0 war room. Some cruise employees later express concerns about this
10-05, 10:46am: Forbes asks Cruise for comment on AV dragging. Cruise declines to comment and stands by 10-03 press release
1:07pm: CPUC sends request for information with 10-19 response deadline
10-06, 10:31am: Forbes publishes "Cruise Robotaxi Dragged Woman 20 Feet in Recent Accident, Local Politician Says"
10-10, 4:00pm: DMV requests more complete video from Cruise. Cruise responds same day, offering to screenshare video
10-11, 11am: Cruise meeting with DMV on operational issues unrelated to accident. DMV's video request "briefly discussed"
12:48pm: Cruise paralegal submits 10-day report to NHTSA after checking for updates. Report doesn't mention pullover or dragging "as no one told the paralegal these facts needed to be added"
10-12, 3pm: NHTSA notifies Cruise that it intends to open Preliminary Evaluation (PE) for 10-02 accident and 3 other pedestrian-related events
10-13, 10am: Cruise meets with DMV and CHP to share 9m 6-pane video and DMV clarifies that it wants the 45s 9-pane video ("full video")
12:19pm: Cruise uploads full video
1:30pm: Cruise meets with NHTSA and argues that PE is unwarranted
10-16, 11:30am: Cruise meets with DMV and CHP, who state they don't believe they were shown full video during 10-03 meeting
10-16: NHTSA officially opens PE
10-18, 3:00pm: Cruise holds standing monthly meeting with CPUC. Cruise says they'll meet CPUC's 10-19 deadline
10-19, 1:40pm: Cruise provides information and full video in response to 10-05 request
10-23, 2:35pm: Cruise learns of possible DMV suspension of driverless permit
10-24, 10:28am: DMV issues suspension of Cruise's driverless permit. Except for the few employees who heard on 10-23, Cruise employees are surprised
10:49am: Cruise publishes blog post which states: "Shortly after the incident, our team proactively shared information with the California Department of Motor Vehicles (DMV), California Public Utilities Commision [sic] (CPUC), and National Highway Traffic Safety Administration (NHTSA), including the full video, and have stayed in close contact with regulators to answer their questions"
11-02, 12:03pm: Cruise submits 30-day NHTSA report, which includes discussion of pullover and dragging
11-02: Cruise recalls 950 systems as a result of 10-02 accident
12-01: CPUC issues Order to Show Cause "for failing to provide complete information and for making misleading public comments regarding the October 2, 2023 Cruise related incident and its subsequent interactions with the commission"

D. Video Footage of the Accident

6 videos
- Offload 1; 9:29pm: low res, 3s, 4-pane. Captures 3s immediately after collision, including audio
- Offload 2; 9:32pm*: 14s, 9-pane. No audio. Shows Nissan pedestrian collision and pedestrian being thrown into path of Cruise AV
- Media Video; 10:04pm: 21s, 4-pane. Derived from offload 2, but slowed down
- 1:06am: 4s clip of offload 2, cut by Vogt and sent to SVP of Government Affairs David Estrada and Chief Legal Officer Jeff Bleich with "this is the cut I was thinking of". Estrada responds with "yes, agree that should be the primary video that gets released if one is released". Video is a single pane from left-front of AV and only shows Nissan hitting pedestrian. Estrada says this should be shown in meetings with regulators first to show "what happened in clarity so you can see how the event happened (establish clear fault of human driver)", but "no evidence this shorter 4-second video was shown at any of the regulatory meetings."
- 3:21am: 12s 9-pane video derived from offload 2. Cruise VP of Global Government Affairs Prashanthi Raman and Estrada asked Wood for shorter version of 14s video, "given last night’s Sev 0 and our need to discuss with policymakers, can you please make us a usable video of this angle [link to Webviz]. We only need to show the impact and the person landing in front of us and then cut it there". Wood created the video. Cruise’s Senior Director of Federal Affairs Eric Danko tells Wood, "believe NHTSA will want video footage that captures moment of our impact as well" and Wood replies, " can create a NHTSA version video once the logs have been offloaded"
- "full video"; 6:28am: 45s, 9-pane, shows pullover and dragging. No audio. Link to full video posted to war room slack at

E. The Facts Regarding What Cruise Knew and When About the October 2 Accident

1. Facts Cruise Learned the Evening of October 2

a. Accident Scene

Driverless Support Specialists (DSS) arrive at scene, 9:39pm to 9:44pm
Another 2-person DSS team arrives with member of operations team and member of Safety Escalation team (SET), 10:00-10:30pm
At least one contractor takes > 100 photos and video and indicated an understanding of pedestrian dragging
- Contractor noted blood and skin pieces, took long shots of trail of blood that indicated traveled after impact; contractor was instructed to bring phone to Cruise instead of uploading video onto customary Slack channel; contractor believes this was to protect injured pedestrian's privacy
- Photos and video uploaded into "RINO" database at 2:23am, accessed by > 100 employees starting at 10-03, 5:11am; DB doesn't show which specific employees reviewed specific photos and videos
Another person at the scene denied knowing about dragging
In Cruise's internal review, prior to Quinn Emanuel, One Remote Assistance operator (RA) said they saw "ped flung onto hood of AV. You could see and hear the bump" and another saw AV "was already pulling over to the side". Quinn Emanuel didn't find out about these until after the RIF on 12-14. On reaching out, one declined the interview and the other didn't respond
- Two other interviewees reported discussion of secondary movement of AV on evening of 10-02 or early morning of 10-03 but "this information has not been verified and appears contrary to the weight of the evidence"
No employees interviewed by Quinn Emanuel indicated they knew about dragging on 10-02

b. Virtual "Sev-0 War Room"

20 people initially in war room
200+ joined and left war room on 10-02 and 10-03
2 interviewees recalled discussion of pedestrian dragging in Sev-0 war room on Meet. Neither could identify who was involved in the discussion or the timing of the discussion. One said it was after 4:00am
Cruise incident response playbook outlines roles of Incident Commander, SLT, CMT as well as how to respond in the weeks after incident. Playbook was not followed, said to be "aborted" because "too manually intensive"

c. Initial Media Narrative About the October 2 Accident

"Although the War Room was supposed to address a variety of issues such as understanding how the accident happened and next steps, the focus quickly centered almost exclusively on correcting a false media narrative that the Cruise AV had caused the Accident"

2. Facts Cruise Learned on October 3

a. The 12:15 a.m. "Sev-0 Collision SFO" Meeting

CMT Incident Manager convened meeting with 140 invites
Focus on sharing updates and media narrative strategy
Slack communications show Cruise employees viewed risk that public could think that Cruise AV injured pedestrian as a crisis
Estrada says to Raman, "feels like we are fighting with both arms tied behind our back if we are so afraid of releasing an exonerating video, very naïve if we think we won’t get walloped by media and enemies"
- Raman responds, "we are under siege is my opinion, we have no fighting chance with these headlines/media stories…we are drowning — and we will lose every time"
- above statement is said to have "captured well the feeling within Cruise’s senior leadership"
Vogt attended meeting and wanted to reveal only 4s part of clip showing the Nissan hitting the pedestrian
- Vogt insisted he wanted to authorize any video or media statement before release, "nothing would be shared or done" without his approval
In parallel, comms team drafted bullet points to share with media, including "AV came to a complete stop immediately after impacting the struck pedestrian", which comms team did not know was inaccurate

b. Engineer’s 3:45 a.m. Slack Message

Slack communication
- Wood: I have not seen this mentioned yet, but in the 1st RA Session the AV is stopped nearly right next to the adjacent vehicle but drives forward another 1-2 car lengths before coming to it's [sic] final position.
- Unnamed employee: ACP, I can’t access the link but is the PED under the vehicle while it continues to move? Am I understanding that correctly
- Wood: I believe so and the AV video can be seen moving vertically
Wood determined this by looking at data from RA center, which implied AV movement of 1-2 car lengths with dragged pedestrian

c. The 6:00 a.m. Crisis Management Team (CMT) Meeting

CMT discussed pullover and dragging
100+ in meeting, including "COO Gil West, co-founder and Chief Product Officer Dan Kan, VP of Communications, Senior Director of Federal Affairs, and members of the communications, legal, engineering, safety, regulatory, and government affairs teams"
6:17am, engineer slacks Wood, "have they raised the issue that the AV moved post-event? On this call. I joined late" and Wood responds "Not yet. I will raise"
Slack conversation during meet, from West to 6 other senior leaders:
- West: ACP- For awareness it was reported at the CMT meeting that the AV moved 1-2 vehicle lengths before RA connection (low collision and looking for pull over before e-stop hit)
- Vogt: Should we run road to sim and see what the AV would have done if it was in the other vehicles position? I think that might be quite powerful
- West: Good idea- I suspect the AV would have stopped and never hit the Ped in the first place
engineer summarized CMT meeting in war room slack, "in the CMT meeting this morning, there was discussion of releasing/sharing video at some point. Matt Wood also noted that the AV travels at a slow speed after the collision, with the pedestrian underneath the car (it's about 7 meters). It wasn't discussed, but i wanted to point out that someone who has access to our AV video above up until the point of the collision could see later that the AV traveled about this distance post-collision, because there is a video on social media which shows the AV stopped with the pedestrian underneath, and there are some markers in the scene."
- engineer also pointed out that non-engineers should be able to deduce pedestrian dragging from before before AV impact plus social media video showing AV's final position. After the DMV suspension order, also said "I pointed out in the channel that it was not hard to conclude there was movement after the initial stop…it seems the DMV fully understanding the entire details was predictable"

d. The 6:45 a.m. Senior Leadership Team (SLT) Meeting

Dragging discussed in SLT meeting
SLT discussed amending media statement, "the outcome [of these discussions] was whatever statement was published on social we would stick with because the decision was we would lose credibility by editing a previously agreed upon statement"
At this point, senior members of comms team knew that "AV came to a complete stop immediately after impacting the struck pedestrian" statement was inaccurate, but comms team continued giving inaccurate statement to press after SLT meeting, resulting in publications with incorrect statements in Forbes, CNBC, ABC News Digital, Engadget, Jalopnik, and The Register
- "complete stop" removed on 10-13 after comms employee flagged statement to legal, which said "I don’t think we can say this"

e. The 7:45 a.m. and 10:35 a.m. Engineering and Safety Team

Meetings

[no notes]

f. The 12:05 p.m. CMT Meeting

[no notes]

g. The 12:40 p.m. SLT Meeting

"Vogt is said to have stated that it was good the AV stopped after 20 feet when it detected interference with its tire rather than continuing, as AVs are programmed to do, to look for a safe place to pull over for up to 100 feet or one full block"
Safety and eng teams raised question of grounding fleet until fix deployed
- Vogt and West rejected idea

h. The 6:05 p.m. CMT Meeting

CMT leaders learn SLT is disbanding Sev-0 war room
Some interviewees expressed concern that no future CMT meetings scheduled for biggest incident in Cruise history
- Some suggested to Chief Legal Officer Jeff Bleich that "miniature CMT" should continue to meet; Bleich and others supportive, but this wasn't done

3. Cruise’s Response to the Forbes Article

Forbes reached out about pedestrian dragging
- Cruise decides not to respond to avoid triggering a new media cycle
  - Cruise stops sharing video with media

III. CRUISE’S COMMUNICATIONS WITH REGULATORS, CITY OFFICIALS, AND OTHER STAKEHOLDERS

A. Overview of Cruise’s Initial Outreach and Meetings with Regulators

"initial blurb" drafted by 12:24am; Cruise not aware of dragging at the time

B. The Mayor’s Office Meeting on October 3

Meeting with Mayor’s Transportation Advisor Alexandra Sweet
Cruise employee gave overview, then Wood played full video
- This approach became the standard presentation from Cruise
Full video was played twice by Wood, but there were connectivity issues
Sweet apparently noticed that the vehicle moved again, but didn't ask about dragging or why vehicle moved again

C. Cruise’s Disclosures to the National Highway Traffic Safety Administration (NHTSA)

1. Cruise’s Initial Outreach on October 3

**10-03, 7:25am **: Cruise’s Head of Regulatory Engagement emailed NHTSA
- NHTSA issues they wanted addressed included "Whether the Cruise ADS or remote assistant could ascertain that the pedestrian was trapped under the vehicle or the location of a pedestrian on the ground" and "vehicle control dynamics (lateral and longitudinal) leading to the incident and following impact including ADS predicted path of the pedestrian and whether any crash avoidance or mitigation took place" and video of accident

2. Cruise’s NHTSA Pre-Meeting

Talking points for anticipated questions
- Did you stop the fleet?
  - Alicia Fenrick: We have not changed the posture of the fleet.
  - We have not identified a fault in AV response
- Why did the vehicle move after it had initially stopped?
  - [Not discussed] Matthew Wood: The impact triggered a collision detection and the vehicle is designed to pull over out of lane in
- Why didn't the vehicle brake in anticipation of the pedestrian in the road?
  - Matthew Wood: I think the video speaks for itself, the pedestrian is well past our lane of travel into the other lane
  - Alicia Fenrick: The pedestrian was clearly well past lane of travel of the AV. It would not be reasonable to expect that the other vehicle would speed up and proceed to hit the pedestrian, and then for the pedestrian to flip over the adjacent car and wind up in our lane.
Excerpt of notes from an employee:
- They requested a video-wait until the meeting at least. Then another question-where we end the video.
- Alicia: Biggest issue candidly. That we moved, and why, is something we are going to need to explain. The facts are what they are.
- Matt: why we moved, it is a collision response. Detected as a minor collision, so design response is a permissible lane pullover.
- How to reference this: triggered a collision detection and was designed to pull over out of lane. Do not qualify as minor collision, rather as a collision detection.
- Questions will be: it stopped and then proceeded forward.
- General premise: we are looking into this, we are doing a deep dive, we have done some preliminary analysis and this is what we have but it is only preliminary.
- Buckets: before impact, impact, after impact.
Slack messages show discussion of when video should be sent and which video should be sent; decided to play full video due to to avoid being "accused of hiding the ball"

3. Cruise’s Meeting with NHTSA on October 3

Wood played the full video 2 or 3 times, "but it kept stopping or blacking- or whiting out because his home computer was having connectivity issues"
"NHTSA did not see the Full Video clearly or in its entirety"
No discussion of pullover or dragging
- Pre-meeting notes edited after meeting, adding "[Not discussed]" to this item
Meeting notes of some questions asked by NHTSA
- Could RA detect that pedestrian trapped?
  - Wood: Yes
- Sensors too?
  - Wood: Yes
- The statement "the last thing you would want to do is move when a pedestrian is underneath" appears to have been said, but recollections disagree on who said it. Some believe Wood said this and NHTSA concurred, some believe Wood said this an NHTSA repeated the statement, and some believe that NHTSA said this and Wood concurred
Post-meeting slack discussion
- Employee: "I think we might need to mention the comment Matt made during the NHTSA call that the last thing you would want to do is move with a pedestrian under the car. From my notes and recollection Matt said 'As pedestrian is under vehicle last thing want to do is maneuver' and [the NHTSA regulator] agreed"
- Another employee: "lets see where the conversation goes. if it’s relevant, we should share it. That’s not the main point here though"
- In other discussions, other employees and execs express varying levels of concern on the non-disclosure of pullover and dragging, from moderate to none (e.g., Senior Director of Federal Affair says he "stands by it ... [Cruise's employees] have gone beyond their regulatory requirements")

4. Cruise’s NHTSA Post-Meeting on October 3

NHTSA sent a request for video and Cruise uploaded full video

5. Cruise’s Interactions with NHTSA on October 12, 13, and 16

a. October 12 Call

NHTSA regulator called Cruise employee and informed them that NHTSA was planning Preliminary Evaluation; employee sent the following to Cruise NHTSA team:
- "She shared that there was a lot of consternation in the front office about last week's incident. It is going to be a pretty broad investigation into how vehicles react to pedestrians out in the street and people in the roadway. But questions about last week's incident will be included in the IR questions and analysis. I offered an additional briefing about last week's incident, but she said that we were quite upfront and shared the video and told them everything they need to know."
- "it [is] difficult to believe that they could find fault with our reaction to the pedestrian in Panini [Panini is the name of the specific AV] that would extend beyond asking us additional questions in a follow-up…"

b. October 13 Meeting

"Despite the severe consequences that could result from a PE, including a recall, Cruise’s Chief Legal Officer and Senior Vice President of Government Affairs did not attend"
From meeting agenda: "We’re just a little confused by this. We met with you about the Panini incident last week, and the team didn’t express any remaining concerns about it, even when asked if you had any additional concerns. Was there really remaining concern about AV behavior regarding Panini? If yes, why did they not request another briefing? We’ve been extremely cooperative with the Agency and have always provided information that the agency requested. What will be gained by this escalation that we are not already providing? Offer briefing on any of these topics in lieu of PE."
Also planned to state: "Regarding last week’s incident we briefed the agency within hours of the event, provided video, and offered repeatedly to share additional information, including around the topic of pedestrian safety broadly. None was requested, which makes us question the motivations behind opening a PE. PEs are punitive means to gather information, and are reputationally harmful, particularly in a nascent industry."

c. October 16 PE

[no notes]

6. Cruise’s NHTSA Reports Regarding the October 2 Accident

NHTSA's SGO requires three written reports, including "a written description of the pre-crash, crash, and post-crash details"
Cruise's first two reports did not mention pullover and dragging; after consultation with GM, third report did mention pullover and dragging

a. NHTSA 1-Day Report

Original draft forwarded from paralegal to Deputy General Counsel Alicia Fenrick, Director of Communications Erik Moser, and Managing Legal Counsel Andrew Rubenstein: "A Cruise autonomous vehicle ("AV"), operating in driverless autonomous mode, was at a complete stop in response to a red light on southbound Cyril Magnin Street at the intersection with Market Street. A dark colored Nissan Sentra was also stopped in the adjacent lane to the left of the AV. As the Nissan Sentra and the AV proceeded through the intersection after the light turned green, a pedestrian entered the crosswalk on the opposite side of Market Street across from the vehicles and proceeded through the intersection against a red light. The pedestrian passed through the AV's lane of travel but stopped mid-crosswalk in the adjacent lane. Shortly thereafter, the Nissan Sentra made contact with the pedestrian, launching the pedestrian in front of the AV. The AV braked aggressively but, shortly thereafter, made contact with the pedestrian. This caused no damage to the AV. The driver of the Nissan Sentra left the scene shortly after the collision. Police and Emergency Medical Services (EMS) were called to the scene. The pedestrian was transported by EMS."
- LGTM'd [approved] by Fenrick and Moser; Rubenstein said "the GA folks have suggested some additional edits", which included adding that the "completely" pass through AV's lane of travel, changing "launching" to "deflecting", and removing "this caused no damage to the AV"; no discussion of possible inclusion of pullover and dragging
Cruise employee who established NHTSA reporting system believed that full details, including pullover and dragging, should've been included, but they were on vacation at the time
In later, 10-24, employee Q&A on DMV suspension order, an employee asked "Why was the decision made not to include the post-collision pull-over in the written report to the NHTSA? At least, this seems like it must have been an intentional decision, not an accidental oversight."
- Rubenstein drafted this prepared response for Fenrick: "The purpose of the NHTSA reporting requirement is to notify the agency of the occurrence of crashes. Consistent with that objective and our usual practice, our report notified NHTSA that the crash had occurred. Additionally, we had already met with NHTSA, including showing the full video to them, prior to submission of the report. That meeting was the result of our proactive outreach: we immediately reached out to NHTSA after the incident to set up a meeting to discuss with them. Our team met with NHTSA in the morning following the incident, including showing the full video to NHTSA. We then submitted the report and sent a copy of the full video later that afternoon."
- Fenrick LGTM'd the above, but the response ended up not being given
Quinn Emanuel notes, "It is difficult to square this rationale with the plain language of the NHTSA regulation itself, which requires “a written description of the pre-crash, crash, *and post-crash details….*” (emphasis added)"

b. NHTSA 10-Day Report

Paralegal had full authority to determine if any new info or updates were necessary
Paralegal asked three employees on slack, "hi, checking in to see if there have been any updates to this incident? In particular, any status on the ped"
- An employee who interacts with law enforcement responded "Unfortunately no. I’ve reached out to the investigating sergeant but have not received a response. This is probably due to other investigations he may be involved in"
- This employee said that they were referring only to the pedestrian's medical condition, but the paralegal took the response more broadly
The paralegal also checked the RINO database for updates and saw none, then filed the 10-day report, which states "There are no updates related to this incident since the original submission on October 3, 2023" and then repeats the narrative in the 1-day report, omitting discussion of pullover and dragging

c. NHTSA 30-Day Report

GM urged Cruise to be more comprehensive with 30-day report, so CLO Bleich got involved.
- Bleich reviewed the 1-day and 10-day reports, and then followed up with "[t]he most important thing now is simply to be complete and accurate in our reporting of this event to our regulators", says to include the pullover and dragging
- Rubenstein objected to including dragging in 30-day report

7. Conclusions Regarding Cruise’s Interactions with NHTSA

[no notes]

D. Cruise’s Disclosures to the Department of Motor Vehicles (DMV)

1. Cruise’s Initial Outreach to the DMV and Internal Discussion of Which Video to Show

"Vogt wanted to focus solely on the Nissan’s role in causing the Accident and avoid showing the pedestrian’s injuries"
Estrada to Raman, apparently concurring: "Think we should get the clip of the video as Kyle described to prepare to show it to policymakers ... show the impact and the person landing in front of us. Cut it there. That's all that is needed."
Raman and Danko disagreed and pushed for showing most complete video available

2. DMV’s Response to Cruise’s Outreach

[no notes]

3. Cruise’s DMV Pre-Meeting

"While Deputy General Counsel Fenrick said she did not typically attend DMV meetings, she opted to attend this meeting in order to have some overlapping attendees between the NHTSA and DMV meetings. Notably, neither Bleich nor Estrada attended the pre-meeting despite planning to meet in-person with the DMV Director to discuss the Accident."

4. Cruise’s October 3 Meeting with the DMV

a. DMV Meeting Discussions

DMV regulators do not believe full video was played
Cruise employees have different recollections, but many believe full video was played, likely with bad connectivity problems
No discussion of pullover or dragging

b. Cruise’s Post-DMV Meeting Reflections

Slack discussion
- Raman: thoughts?
- Fenrick: You mean DMV call? More aggressive than NHTSA . . .
- ACP - Not overly so but seemed a bit more critical and a little unrealistic. Like really we should predict another vehicle will hit and run and brake accordingly. I think they think they're expectations of anticipatory response is to other road users collisions was a bit off.
- Raman: They tend to ask insane hypotheticals. I was about to interrupt and say we can go through any number of hypos.. this is what happened but I was waiting for them to ask a followup question before I did it.
- Fenrick: insane hypothetical is absolutely right
- ACP - Bigger concern is that no regulator has really clued in that we moved after rolling over the pedestrian
In another slack discussion, an employee stated "the car moved and they didn’t ask and we’re kind of lucky they didn’t ask"
- Some employees indicate that this was the general consensus about the meeting

5. Cruise’s October 10 Communications with DMV

DMV asked for video by 10-11. Cruise did not do this, but showed a video in a meeting on 10-13

6. Cruise’s October 11 Meeting with the DMV

[no notes]

7. Cruise’s October 13 Meeting with the DMV

Cruise shared 9-minute 6-pane video created by Wood
- "Notably, the camera angles did not include the lower frontal camera angles that most clearly showed the AV’s impact with the pedestrian and pullover maneuver"
"Interviewees said that the DMV’s tone in the meeting 'felt very mistrustful' and that it 'felt like something was not right here.'"
- DMV had questions about what appeared to be missing or misleading video
- In response to DMV's concerns and request, Cruise uploaded full video to DMV online portal

8. Cruise’s October 16 Meeting with the DMV

Meeting was scheduled for a different topic, but meeting moved to topic of DMV being misled about the accident; "Cruise interviewees recalled that the DMV and CHP attendees were angry about the October 3 presentation, saying their collective memory was that they were not shown the Full Video"

9. Cruise’s October 23 Communications with the DMV

Cruise calls political consultant to have them find out why DMV has been silent on expansion of SF autonomous fleet
- Consultant says DMV is "pissed" and considering revocation of Cruise's license to operate
Internal disagreement on whether this could happen. "Estrada then sent CLO Bleich a Slack message indicating that he had talked to the DMV Director and there was '[n]o indication whatsoever that they are considering revoking.'"
- Raman checks with political consultant again, who repeats that DMV is very angry and may revoke

10. DMV’s October 24 Suspension Order

Estrada calls DMV director Gordon to ask about suspension and is stonewalled
Vogt joins the call and makes personal appeal, saying he's "been committed to this since he was 13 to try and improve driver safety"
Appeal fails and suspension order is issued shortly afterwards
Slack conversation
- Estrada: Kyle leading our response that we provided "full" video and we will stand by that if it's a fight.
- Bleich: ACP- This will be a difficult fight to win. DMV and CHP have credibility and Steve Gordon seems to swear that he did not see the end of the video. The word of Cruise employees won't be trusted. I think we should bring in an outside firm to review the sequence of events and do an internal report since otherwise there is no basis for people to believe us. We should consider doing this and how to message it.
- Estrada: Yes agree difficult and that we need to do it because we have facts, we can have sworn statements and data analytics on our side. Not a he said she said. We have proof. If we prove with facts a false statement that is important reputation saving.
- Steve stopped even trying to make this claim. He resorted to arguing we should have highlighted the pullover attempt. This is a big overreach by them to make a claim like this we have the ability to prove false.

11. Post-October 24 DMV Communications

Vogt posted this blog post, titled "A detailed review of the recent SF hit-and-run incident"
- [The report only has an excerpt from the blog post, but for the same reason I think it's worth looking at the report in detail, I think it's worth looking at the blog post linked above; my read of the now-deleted blog post is that it attempts to place the blame on the "hit and run" driver, which is repeatedly emphasized; the blog post also appears to include a video of the simulation discussed above, where Vogt says "Should we run road to sim and see what the AV would have done if it was in the other vehicles position? I think that might be quite powerful"]
- [The blog post does discuss the pullover and dragging, saying "The AV detected a collision, bringing the vehicle to a stop; then attempted to pull over to avoid causing further road safety issues, pulling the individual forward approximately 20 feet"]

12. Conclusions Regarding Cruise’s Communications with the DMV

Bleich: "[T]he main concern from DMV was that our vehicle did not distinguish between a person and another object under its carriage originally, and so went into an MRC. Second, they felt that we should have emphasized the AV’s second movement right away in our first meeting. In fact, in the first meeting -- although we showed them the full video -- they (and we) were focused on confirming that we were not operating unsafely before the collision and we did not cause the initial contact with the pedestrian. They did not focus on the end of the video and -- because they did not raise it -- our team did not actively address it"
Vogt: "I am very much struggling with the fact that our GA team did not volunteer the info about the secondary movement with the DMV, and that during the handling of the event I remember getting inconsistent reports as to what was shared. At some point bad judgment call must have been made, and I want to know how that happened."
Bleich: "ACP -- I share your concern that the second movement wasn’t part of the discussion. I don’t know that there was a deliberate decision by the team that was doing the briefings. I believe they were still in the mode from the previous evening where they were pushing back against an assumption that we either were responsible for hitting the pedestrian or that we did not react fast enough when the pedestrian fell into our path. But as I’ve probed for basic information about what we shared and when I’ve had the same frustration that dates get pushed together or details are left out. I don’t know if this is deliberate, or people are simply having difficulty recalling exactly what they did or said during the immediate aftermath of that event."
"these Slacks convey that the three senior leaders of the company – the CEO, CLO, and COO – were not actively engaged in the regulatory response for the worst accident in Cruise’s history. Instead, they were trying to piece together what happened after the fact."

E. Cruise’s Disclosures to the SF MTA, SF Fire Department, and SF Police

Department

After playing video, a government official asks "this car moves with the woman underneath it, is that what we are seeing?", which results in a series of discussions about this topic
Two of the four Cruise employees in the meeting report being shocked to see the pullover and dragging, apparently not realizing that this had happened

F. Cruise’s Disclosures to the California Public Utilities Commission (CPUC)

1. Cruise’s October 3 Communications with the CPUC

CPUC and Cruise disagree on whether or not there was an offer to play the full video

2. CPUC’s October 5 Data Request

CPUC requests video by 10-19; Cruise's standard policy was to respond on the last day, so video was sent on 10-19

3. Cruise’s October 19 Response to CPUC’s Data Request

Video, along with the following summary: "[T]he Nissan Sentra made contact with the pedestrian, deflecting the pedestrian in front of the AV. The AV biased rightward before braking aggressively but, shortly thereafter, made contact with the pedestrian. The AV then attempted to achieve a minimal risk condition (MRC) by pulling out of the lane before coming to its final stop position. The driver of the Nissan Sentra left the scene shortly after the collision."

4. Conclusions Regarding Cruise’s Disclosures to the CPUC

[no notes]

G. Cruise’s Disclosures to Other Federal Officials

Cruise's initial outreach focused on conveying the accident had been caused by the hit-and-run Nissan driver
After the DMV suspension on 10-24, "outreach focused on conveying the message that it believed it had worked closely with regulatory agencies such as the California DMV, CPUC, and NHTSA following the October 2 Accident"

IV. THE AFTERMATH OF THE OCTOBER 2 ACCIDENT

A. The Cruise License Suspension by the DMV in California

Operating with human driver behind the wheel still allowed

B. The NHTSA PE Investigation and Safety Recall

[no notes]

C. The CPUC’s “Show Cause Ruling”

[no notes]

D. New Senior Management of Cruise and the Downsizing of Cruise

[no notes]

V. SUMMARY OF FINDINGS AND CONCLUSIONS

"By the time Cruise employees from legal, government affairs, operations, and systems integrity met with regulators and other government officials on October 3, they knew or should have known that the Cruise AV had engaged in a pullover maneuver and dragged the pedestrian underneath the vehicle for approximately 20 feet"
- "Cruise’s passive, non-transparent approach to its disclosure obligations to its regulators reflects a basic misunderstanding of what regulatory authorities need to know and when they need to know it"
"Although neither Cruise nor Quinn Emanuel can definitively establish that NHTSA or DMV were shown the entirety of the Full Video, including the pullover maneuver and dragging, the weight of the evidence indicates that Cruise attempted to play the Full Video in these meetings; however, internet connectivity issues impeded or prevented these regulators from seeing the video clearly or fully."
- "in the face of these internet connectivity issues that caused the video to freeze or black- or white-out, Cruise employees remained silent, failing to ensure that the regulators understood what they likely could not see – that the Cruise AV had moved forward again after the initial impact, dragging the pedestrian underneath the vehicle"
"Even if, as some Cruise employees stated, they were unaware of the pullover maneuver and pedestrian dragging at the time of certain regulatory briefings (which itself raises other concerns), Cruise leadership and other personnel were informed about the full details of the October 2 Accident during the day on October 3 and should have taken corrective action."
"While Cruise employees clearly demonstrated mistakes of judgment and failure to appreciate the importance of transparency and accountability, based on Quinn Emanuel’s review to date, the evidence does not establish that Cruise employees sought to intentionally mislead government regulators about the October 2 Accident, including the pullover maneuver and pedestrian dragging"
"Cruise’s senior leadership repeatedly failed to understand the importance of public trust and accountability"
"Cruise’s response to the October 2 Accident reflects deficient leadership at the highest levels of the Company—including among some members of the C-Suite, legal, governmental affairs, systems integrity, and communications teams—that led to a lack of coordination, mistakes of judgment, misapprehension of regulatory requirements and expectations, and inconsistent disclosures and discussions of material facts at critical meetings with regulators and other government officials. The end result has been a profound loss of public and governmental trust and a suspension of Cruise’s business in California"
- "There was no captain of the ship. No single person or team within Cruise appears to have taken responsibility to ensure a coordinated and fully transparent disclosure of all material facts regarding the October 2 Accident to the DMV, NHTSA, and other governmental officials. Various members of the SLT who had the responsibility for managing the response to this Accident were missing-in-action for key meetings, both preparatory and/or with the regulators. This left each Cruise team to prepare for the meetings independently, with different employees attending different regulatory meetings, and with no senior Cruise official providing overall direction to ensure consistency in approach and disclosure of all material facts."
- "There was no demonstrated understanding of regulatory expectations by certain senior Cruise management or line employees"
- "Cruise’s deficient regulatory response to the October 2 Accident reflects preexisting weaknesses in the Company, including ineffectual Cruise leadership with respect to certain senior leaders. Two out of many examples illustrate these weaknesses."
  - No coordinated or rigorous process for what needed to be discussed with DMV, NHTSA, etc., nor did leadership or employees in meetings take steps to ensure they were informed of what had happened before the meetings (such as asking their direct reports for updates); "To underscore Cruise’s lack of coordination in its briefings to regulators and other government officials on October 3, senior leadership never convened a meeting of the various teams to discuss and learn how these meetings went, what questions were asked, and what discussions took place. Had they done so, they should have realized that in only one of the four meetings did government officials ask questions about the pullover maneuver and pedestrian dragging, requiring corrective action"
  - "Cruise lawyers displayed a lack of understanding of what information must be communicated to NHTSA in these reports, and misapprehended the NHTSA requirement ... Cruise leadership gave a paralegal the primary responsibility for preparing and filing such reports with the Cruise legal department exercising little oversight"

VI. RECOMMENDATIONS

New senior leadership
Consider creating a dedicated, cross-disciplinary Regulatory Team which understands regulations, has experience dealing with regulators, and proactively improves Cruise's regulatory reporting processes and systems, reporting directly to CEO with board oversight
Training for remaining senior leadership
Create a streamlined Crisis Management Team
- 200 people in a war room can't manage a crisis; also need to have a "captain" or someone in charge
Review incident response protocol and ensure that it is followed
"There is a need to reform the governmental affairs, legal, and public communications functions within Cruise"
"Cruise should file its reports about any accident involving a Cruise vehicle with regulators by having a designated Chief Safety Officer or senior engineer, as well as a regulatory lawyer, within Cruise review and approve the filing of each report"

Appendix

The report by Exponent, mentioned above, is included in the Appendix. It is mostly redacted, although there is a lot of interesting non-redacted content, such as "the collision detection system incorrectly identified the pedestrian as being located on the side of the AV at the time of impact instead of in front of the AV and thus determined the collision to be a side impact ... The determination by the ADS that a side collision occurred, and not a frontal collision, led to a less severe collision response being executed and resulted in the AV performing the subsequent outermost lane stop maneuver instead of an emergency stop ... The root cause of the AV’s post-collision movement, after the initial brief stop, was the inaccurate determination by the ADS that a side collision had occurred ... the inaccuracy of the object track considered by the collision detection system and the resulting disparity between this track and the pedestrian’s actual position, the ADS failed to accurately determine the location of the pedestrian at the time of impact and while the pedestrian was underneath the vehicle"

back to danluu.com

I don't have much to add to this. I certainly have opinions, but I don't work in automotive and haven't dug into it enough to feel informed enough to add my own thoughts. In one discussion I had with a retired exec who used to work on autonomous vehicles, on incident management at Cruise vs. tech companies Twitter or Slack, the former exec said:

You get good at incidents given a steady stream of incidents of varying severity if you have to handle the many small ones. You get terrible at incidents if you can cover up the small ones until a big one happens. So it's not only funny but natural for internet companies to do it better than AV companies I think

On the "minimal risk condition" pullover maneuver, this exec said:

These pullover maneuvers are magic pixie dust making AVs safe: if something happens, we'll do a safety pullover maneuver

And on the now-deleted blog post, "A detailed review of the recent SF hit-and-run incident", the exec said:

Their mentioning of regulatory ADAS test cases does not inspire confidence; these tests are shit. But it's a bit unfair on my part since of course they would mention these tests, it doesn't mean they don't have better ones

On how regulations and processes making safety-critical industries safer and what you'd do if you cared about safety vs. the recommendations in the report, this exec said

[Dan,] you care about things being done right. People in these industries care about compliance. Anything "above the state of the art" buys you zero brownie points. eg for [X], any [Y] ATM are not required at all. [We] are better at [X] than most and it does nothing for compliance ... OTOH if a terrible tool or process exists that does nothing good but is considered "the state of the art" / is mandated by a standard, you sure as hell are going to use it

If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor, so you should consider my potential bias, but they seem to have found product-market fit and are growing extremely quickly (revenue-wise)

Thanks to an anonymous former AV exec, Justin Blank, and 5d22b for comments/corrections/discussion.

Appendix: a physical hardware curiosity

One question I had for the exec mentioned above, which wasn't relevant to this case, but is something I've wondered about for a while, is why the AVs that I see driving don't have upgraded tires and brakes. You can get much shorter stopping distances from cars that aren't super heavy by upgrading their tires and brakes, but the AVs I've seen have not had this done.

In this case, we can't do the exact comparison from an upgraded vehicle to the base vehicle because the vehicle dynamics data was redacted from section 3.3.3, table 9, and figure 40 of the appendix, but it's common knowledge that the simplest safety upgrade you can make on a car is upgrading the tires (and, if relevant, the brakes). One could argue that this isn't worth the extra running cost, or the effort (for the low-performance cars that I tend to see converted into AVs, getting stopping distances equivalent to a sporty vehicle would generally require modifying the wheel well so that wider tires don't rub) but, as an outsider, I'd be curious to know what the cost benefit trade-off on shorter stopping distances is.

They hadn't considered it before, but thought that better tires and brake would make a difference in a lot of other cases and prevent accients and explained the lack of this upgrade by:

I think if you have a combination of "we want to base AV on commodity cars" and "I am an algorithms guy" mindset you will not go look at what the car should be.

And, to be clear, upgraded tires and brakes would not have changed the outcome in this case. The timeline from the Exponent report has

-2.9s: contact between Nissan and Pedestrian
-2s: Pedestrian track dropped
-1.17s: Pedestrian beings separating from Nissan
-0.81s: [redacted]
-0.78s: Pedestrian lands in AV's travel lane
-0.41s: Collision checker predicts collision
-0.25s: AV starts sending braking and steering commands (19.1 mph)
0s: collision (18.6 mph)

Looking at actual accelerometer data from a car with upgraded tires and brakes, stopping time from 19.1mph for that car was around 0.8s, so this wouldn't have made much difference in this case. If brakes aren't pre-charged before attempting to brake, there's significant latency when initially braking, such that 0.25s isn't enough for almost any braking to have occurred, which we can see from the speed only being 0.5mph slower in this case.

Another comment from the exec is that, while a human might react to the collision at -2.9s and slow down or stop, "scene understanding" as a human might do it is non-existent in most or perhaps all AVs, so it's unsurprising that the AV doesn't react until the pedestrian is in the AV's path, whereas a human, if they noticed the accident in the adjacent lane, would likely drastically slow down or stop (the exec guessed that most humans would come to a complete stop, whereas I guessed that most humans would slow down). The exec was also not surprised by the 530ms latency between the pedestrian landing in the AV's path and the AV starting to attempt to apply the brakes although, as a lay person, I found 530ms surprising.

On the advantage of AVs and ADAS, as implemented today, compared to a human who's looking in the right place, paying attention, etc., the exec said

They mainly never get tired or drink and hopefully also run in that terrible driver's car in the next lane. For [current systems], it's reliability and not peak performance that makes it useful. Peak performance is definitely not superhuman but subhuman

Why do people post on [bad platform] instead of [good platform]?

Thu, 25 Jan 2024 00:00:00 +0000

There's a class of comment you often see when someone makes a popular thread on Mastodon/Twitter/Threads/etc., that you also see on videos that's basically "Why make a Twitter thread? This would be better as a blog post" or "Why make a video? This would be better as a blog post". But, these comments are often stronger in form, such as:

I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy.

When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)

These kinds of comments are fairly common, e.g., I pulled up Foone's last 10 Twitter threads that scored 200 points or more on HN and 9 out of 10 had comments like this, complaining about the use of Twitter.

People often express bafflement that anyone could have a reason for using [bad platform], such as in "how many tweets are there just to make his point? 200? nobody thinks 'maybe this will be more coherent on a single page'? I don't get social media" or "Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work".

Personally, I don't really like video as a format and, for 95% of youtube videos that I see, I'd rather get the information as a blog post than a video (and this will be even more true if Google really cracks down on ad blocking) and I think that, for a reader who's interested in the information, long-form blog posts are basically strictly better than long threads on [bad platform]. But I also recognize that much of the content that I want to read wouldn't exist at all if it wasn't for things like [bad platform].

Stepping back and looking at the big picture, there are four main reasons I've seen that people use [bad platform], which are that it gets more engagement, it's where their friends are, it's lower friction, and it monetizes better.

Engagement

The engagement reason is the simplest, so let's look at that first. Just looking at where people spend their time, short-form platforms like Twitter, Instagram, etc., completely dominate longer form platforms like Medium, Blogspot, etc.; you can see this in the valuations of these companies, in survey data, etc. Substack is the hottest platform for long-form content and its last valuation was ~$600M, basically a rounding error compared to the value of short-form platforms (I'm not including things like Wordpress and or Squarespace, which derive a lot of their valuation from things other than articles and posts). The money is following the people and people have mostly moved on from long-form content. And if you talk to folks using substack about where their readers and growth comes from, that comes from platforms like Twitter, so people doing long-form content who optimize for engagement or revenue will still produce a lot of short-form content¹.

Friends

The friends reason is probably the next simplest. A lot of people are going to use whatever people around them are using. Realistically, if I were ten years younger and started doing something online in 2023 instead of 2013, more likely than not, I would've tried streaming before I tried blogging. But, as an old, out of touch, person, I tried starting a blog in 2013 even knowing that blogging was a dying medium relative to video. It seems to have worked well enough for me, so I've stuck with it, but this seems generational. While there are people older than me who do video and people younger than me who write blogs, looking at the distribution of ages, I'm not all that far from the age where people overwhelmingly moved to video and if I were really planning to do something long-term instead of just doing the lowest friction thing when I started, I would've started with video. Today, doing video is natural for folks who are starting to put their thoughts online.

Friction

When [bad platform] is a microblogging platform like Twitter, Mastodon, Threads, etc., the friends reason still often applies — people on these platforms are frequently part of a community they interact with, and it makes more sense for them to keep their content on the platform full of community members than to put content elsewhere. But the bigger reason for people whose content is widely read is that a lot of people find these platforms are much lower friction than writing blog posts. When people point this out, [bad platform] haters are often baffled, responding with things like

Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work

For one thing, most widely read programmer/tech bloggers that I'm in touch with use platforms that are actually higher friction (e.g., Jekyll friction and Hugo fric tion). But, in principle, they could use substack, hosted wordpress, or another platform that this commenter considers "objectively" lower friction, but this fundamentally misunderstands where the friction comes from. When people talk about [bad platform] being lower friction, it's usually about the emotional barriers to writing and publishing something, not the literal number of clicks it takes to publish something. We can argue about whether or not this is rational, whether this "objectively" makes sense, etc., but at the end of the day, it is simply true that many people find it mentally easier to write on a platform where you write short chunks of text instead of a single large chunk of text.

I sometimes write things on Mastodon because it feels like the right platform for some kinds of content for me. Of course, since the issue is not the number of clicks it takes and there's some underlying emotional motivation, other people have different reasons. For example, Foone says:

Not to humblebrag or anything, but my favorite part of getting posted on hackernews or reddit is that EVERY SINGLE TIME there's one highly-ranked reply that's "jesus man, this could have been a blog post! why make 20 tweets when you can make one blog post?"

CAUSE I CAN'T MAKE A BLOG POST, GOD DAMN IT. I have ADHD. I have bad ADHD that is being treated, and the treatment is NOT WORKING TERRIBLY WELL. I cannot focus on writing blog posts. it will not happen

if I try to make a blog post, it'll end up being abandoned and unfinished, as I am unable to edit it into something readable and postable. so if I went 100% to blogs: You would get: no content I would get: lots of unfinished drafts and a feeling of being a useless waste

but I can do rambly tweet threads. they don't require a lot of attention for a long time, they don't have the endless editing I get into with blog posts, I can do them. I do them a bunch! They're just rambly and twitter, which some people don't like

The issue Foone is referring to isn't even uncommon — three of my favorite bloggers have mentioned that they can really only write things in one sitting, so either they have enough momentum to write an entire blog post or they don't. There's a difference in scale between only being able to get yourself to write a tweet at a time and only being able to write what you can fit into a single writing session, but these are differences in degree, not differences in kind.

Revenue

And whatever the reason someone has for finding [bad platform] lower friction than [good platform], allowing people to use a platform that works for them means we get more content. When it comes to video, the same thing also applies because video monetizes so much better than text and there's a lot of content that monetizes well on video that probably wouldn't monetize well in text.

To pick an arbitrary example, automotive content is one of these areas. For example, if you're buying a car and you want detailed, practical, reviews about a car as well as comparisons to other cars one might consider if they're looking at a particular car, before YouTube, AFAIK, no one was doing anything close to the depth of what Alex Dykes does on Alex on Autos. If you open up a car magazine from the heyday of car magazines, something like Car and Driver or Road and Track from 1997, there's nothing that goes into even 1/10th of the depth that Alex does and this is still true today of modern car magazines. The same goes for quite a few sub-categories of automotive content as well, such as Jonathan Benson's on Tyre Reviews. Before Jonathan, no one was testing tires with the same breadth and depth and writing it up (engineers at tire companies did this kind of testing and much more, but you had to talk to them directly to get the info)² . You can find similar patterns in a lot of areas outside of automotive content as well. While this depends on the area, in many cases, the content wouldn't exist if it weren't for video. Not only do people, in general, have more willingness to watch videos than to read text, video monetizes much better than text does, which allows people to make providing in depth information their job in a way that wouldn't be possible in text. In some areas, you can make good money with a paywalled newsletter, but this is essentially what car magazines are and they were never able to support anything resembling what Alex Dykes does, nor does it seem plausible that you could support something like what Jonathan Benson does on YouTube.

Or, to pick an example from the tech world, shortly after Lucy Wang created her YouTube channel, Tech With Lucy, when she had 50k subscribers and her typical videos had thousands to tens of thousands views with the occasional video with a hundred thousand views, she noted that she was making more than she did working for AWS (with most of the money presumably coming in from sponsorships). By comparison, my blog posts all get well over a million hits and I definitely don't make anywhere near what Lucy made at AWS; instead, my blog barely covers my rent. It's possible to monetize some text decently well if you put most of it behind a paywall, e.g., Gergely Orosz does this with his newsletter, but if you want to have mostly or exclusively have freely available content, video generally dominates text.

Non-conclusion

While I would prefer that most content that I see on YouTube/Twitter/Threads/Mastodon/etc. were hosted on a text blog, the reality is that most of that content wouldn't exist at all if it had to be written up as long-form text instead of as chunked up short-form text or video. Maybe in a few years, summary tools will get good enough that I can consume the translations but, today, all the tools I've tried often get key details badly wrong, so we just have to live with the content in the form it's created in.

If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor in the company, so you should take this with the usual grain of salt, but if you're looking to join a fast growing early-stage startup, they seem to have found product-market fit and have been growing extremely quickly (revenue-wise).

Thanks to Heath Borders, Peter Bhat Harkins, James Young, Sophia Wisdom, and David Kok for comments/corrections/discussion.

Appendix: Elsewhere

Paul Ford's WWIC (Why Wasn't I Consulted) is a more general version of this post

Here's a comment from David Kok, from a discussion about a rant by an 80-year old bridge player about why bridge is declining, where the 80-year old claimed that the main reason is that IQ has declined and young people (as in, people who are 60 and below) are too stupid to play intellectual games like bridge; many other bridge players concurred:

Rather than some wrong but meaningful statement about age groups I always just interpret statements like "IQ has gone down" as "I am unhappy and have difficulty expressing that" and everybody else going "Yes so am I" when they concur.

If you adapt David Kok's comment to complaints about why something isn't a blog post, that's a meta reason that the reasons I gave in this post are irrelevant (to some people) — these reasons only matter to people who care about the reasons; if someone is just venting their feelings an the reasons they're giving are an expression of their feelings and not meant to be legitimate reasons, the reasons someone might not write a blog post are irrelevant.

Anyway, the topic of why post there instead of here is a common enough topic that I'm sure other people have written things about it that I'd be interested in reading. Please feel free to forward other articles you see on the topic to me

Appendix: HN comments on Foone's last 10 Twitter threads.

I looked up Foone's last N Twitter threads that made to HN with 200+ points, and 9 out of 10 have complaints about why Foone used Twitter and how it would be better as a blog post. [This is not including comments of the form "For those who hate Twitter threads as much as I do: https://threadreaderapp.com/thread/1014267515696922624.html", of which there are more than comments like the ones below, which have a complaint but also have some potentially useful content, like a link to another version of the thread.

Never trust a system that seems to be working

One of the first comments was a complaint that it was on Twitter, which was followed not too long after by

how many tweets are there just to make his point? 200? nobody thinks "maybe this will be more coherent on a single page"? I don't get social media

Someday aliens will land and all will be fine until we explain our calendar

This would be better written in a short story format but I digress.

shit like this is too good and entertaining to be on twitter [one of the few positive comments complaining about this]

This person hates it so much whenever there is a link to their content on this site, they go on huge massive rants about it with threads spamming as much as the OP, it's hilarious.

You want to know something about how bullshit insane our brains are?

They'll tolerate reading it on twitter?

Serious question : why do publishers break down their blog posts into umpteen tweeted microblogs? Do the engagement web algorithms give preference to the number of tweets in a thread? I see this is becoming more of a trend

This is a very interesting submission. But, boy, is Twitter's character limit poisonous.

IMO Foone's web presence is toxic. Rather than write a cogent article posted on their blog and then summarize a pointer to that post in a single tweet, they did the opposite writing dozens of tweets as a thread and then summarizing those tweets in a blog post. This is not a web trend I would like to encourage but alas it is catching on.

Oh, I don't care how the author writes it, or whether there's a graph relationship below (or anything else). It's just that Twitter makes the experience of reading content like that a real chore.

Reverse engineering Skifree

This should have been a blog or a livestream.

Even in this format?

I genuinely don't get it. It's a pain in the ass for them to publish it like that and it's a pain in the ass for us to read it like that. I hope Musk takes over Twitter and runs it the ground so we can get actual blog posts back.

Someone points out that Foone has noted that they find writing long-form stuff impossible and can write in short-form media, to which the response is the following:

Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there?

Obviously that's their prerogative and they can do whatever they want but objectively speaking it is more work and I sincerely hope the trend will die.

Everything with a battery should have an off switch

You forgot, foone isn't going to change from streams of Twitter posts to long form blogging. [actually a meta comment on how people always complain about this and not a complaint, I think]

I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy.

He clearly knows [posting long threads on Twitter] is a problem, he should fix it.

Someone points out that Foone has said that they're unable to write long-form blog posts, to which the person replies:

You can append to a blog post as you go the same way you can append to a Twitter feed. It's functionally the same, the medium just isn't a threaded hierarchy. There's no reason it has to be posted fully formed as he declares.

My own blog posts often have 10+ revisions after I've posted them.

It doesn't work well for thousands of people, which is why there are always complaints ... When something is suboptimal, you're well within your rights to complain about it. Posting long rants as Twitter threads is suboptimal for the consumers of said threads

I kind of appreciate the signal: When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)

There's an ARM Cortex-M4 with Bluetooth inside a Covid test kit

Amazingly, no complaint that I could see, although one comment was edited to be "."

Taking apart the 2010 Fisher Price re-released Music Box Record Player

why is this a twitter thread? why not a blog?

Followed by

I love that absolutely no one got the joke ... Foone is a sociopath who doesn't feel certain words should be used to refer to Foone because they don't like them. In fact no one should talk about Foone ever.

While posting to Tumblr, E and W keys just stopped working

Just hotkey detection gone wrong. Not that big of a surprise because implementing hotkeys on a website is a complete minefield. I don't think you can conclude that Tumblr is badly written from this. Badly tested maybe.

Because that comment reads like nonsense to anyone who read the link, someone asks "did you read the whole thread?", to which the commenter responds:

No because Twitter makes it completely unreadable.

My mouse driver is asking for a firewall exemption

Can we have twitter banned from being posted here? On all UI clicks, a nagging window comes up. You can click it away, but it reverts your click, so any kind of navigation becomes really cumbersome.

or twitter urls being replaced with some twitter2readable converter

Duke Nukem 3D Mirror Universe

This is remarkable, but Twitter is such an awful medium for this kind of text. I wish this was posted on a normal platform so I could easily share it.

If this were a blog post instead of a pile of tweets, we wouldn't have to expand multiple replies to see all of the content

Uh why isn't this a blog, or a youtube video? specifically to annoy foone

Yes, long form Twitter is THE WORST. However foone is awesome, so maybe they cancel each other out?

I hate twitter. It's slowly ruining the internet.

Non-foone posts

Of course this kind of thing isn't unique to Foone. For example, on the last Twitter thread I saw on HN, 2 of the first five comments were:

Has this guy got a blog?

and

That's kind of why the answer to "posting something to X" should be "just say no". It's impossible to say anything there that is subtle in the slightest or that requires background to understand but unfortunately people who are under the spell of X just can't begin to see something they do the way somebody else might see it.

I just pulled up Foone's threads because I know that they tend to post to short-form platforms and looking at 10 Foone threads is more interesting than looking at 10 random threads.

Of course, almost no one optimizes for revenue because most people don't make money off of the content they put out on the internet. And I suspect only a tiny fraction of people are consciously optimizing for engagement, but just like we saw with prestige, there seems to be a lot of nonconscious optimization for engagement. A place where you can see this within a platform is (and I've looked at hundreds of examples of this) when people start using a platform like Mastodon or Threads. They'll post a lot of different kinds of things. Most things won't get a lot of traction and a few will. They could continue posting the same things, but they'll often, instead, post less low-engagement content over time and more high-engagement content over time. Platforms have a variety of ways of trying to make other people engage with your content rewarding and, on average, this seems to work on people. This is an intra-platform and not an inter-platform example, but if this works on people, it seems like the inter-platform reasoning should hold as well.

Personally, I'm not optimizing for engagement or revenue, but I've been paying my rent from Patreon earnings, so it would probably make sense to do so. But, at least at the moment, looking into what interests me feels like a higher priority even if that's sort of a revenue and engagement minimizing move. For example, wc has the source of my last post at 20k words, which means that doing two passes of writing over the post might've been something like 7h40m. If I did short-form content instead, a while back, I did an experiment where I tried tweeting daily for a few months, which increased my Twitter followers by ~50% (from ~20k to ~30k). The Twitter experiment probably took about as much time as typing up my last post (which doesn't include the time spent doing the work for the last post which involved, among other things, reading five books and 15 or so papers about tire and vehicle dynamics), so from an engagement or revenue standpoint, posting to short-form platforms totally dominates the kind of writing I'm doing and anyone who care almost at all about engagement or revenue would do the short-form posting instead of long-form writing that takes time to create. As for me, right now, I have two drafts I'm in the middle of which are more like my last post. For one draft, the two major things I need to finish up are writing up a summary of ~500 articles/comments for an appendix and reading a 400 page book I want to quote a few things from, and for the other, I need to finish writing up notes for ~350 pages of FTC memos. Each of these drafts will turn into a blog post that's long enough that it could be a standalone book. In terms of the revenue this drives to my Patreon, I'd be lucky if I make minimum wage from doing this, not even including the time spent on things I research but don't publish because the result is uninteresting. But I'm also a total weirdo. On average, people are going to produce content that gets eyeballs, so of course a lot more people are going to create more hastily written long [bad platform] threads than blog posts.
^[return]
for German-language content, there was one magazine that was doing work that's not as thorough in some ways, but semi-decently close, but no one was translating that into English. Jonathan Benson not only does unprecedented-for-English reviews of tires, he also translates the German reviews into English!

On the broader topic, unfortunately, despite video making more benchmarking financially viable, there's still plenty of stuff where there's no good way to figure out what's better other than by talking to people who work in the industry, such as for ADAS systems, where the public testing is cursory at best.
^[return]

How bad are search results? Let's compare Google, Bing, Marginalia, Kagi, Mwmbl, and ChatGPT

Sat, 30 Dec 2023 00:00:00 +0000

In The birth & death of search engine optimization, Xe suggests

Here's a fun experiment to try. Take an open source project such as yt-dlp and try to find it from a very generic term like "youtube downloader". You won't be able to find it because of all of the content farms that try to rank at the top for that term. Even though yt-dlp is probably actually what you want for a tool to download video from YouTube.

More generally, most tech folks I'm connected to seem to think that Google search results are significantly worse than they were ten years ago (Mastodon poll, Twitter poll, Threads poll). However, there's a sizable group of vocal folks who claim that search results are still great. E.g., a bluesky thought leader who gets high engagement says:

i think the rending of garments about how even google search is terrible now is pretty overblown¹

I suspect what's going on here is that some people have gotten so used working around bad software that they don't even know they're doing it, reflexively doing the modern equivalent of hitting ctrl+s all the time in editors, or ctrl+a; ctrl+c when composing anything in a text box. Every adept user of the modern web has a bag of tricks they use to get decent results from queries. From having watched quite a few users interact with computers, that doesn't appear to be normal, even among people who are quite competent in various technical fields, e.g., mechanical engineering². However, it could be that people who are complaining about bad search result quality are just hopping on the "everything sucks" bandwagon and making totally unsubstantiated comments about search quality.

Since it's fairly easy to try out straightforward, naive, queries, let's try some queries. We'll look at three kinds of queries with five search engines plus ChatGPT and we'll turn off our ad blocker to get the non-expert browsing experience. I once had a computer get owned from browsing to a website with a shady ad, so I hope that doesn't happen here (in that case, I was lucky that I could tell that it happened because the malware was doing so much stuff to my computer that it was impossible to not notice).

One kind of query is a selected set of representative queries a friend of mine used to set up her new computer. My friend is a highly competent engineer outside of tech and wanted help learning "how to use computers", so I watched her try to set up a computer and pointed out holes in her mental model of how to interact with websites and software³.

The second kind of query is queries for the kinds of things I wanted to know in high school where I couldn't find the answer because everyone I asked (teachers, etc.) gave me obviously incorrect answers and I didn't know how to find the right answer. I was able to get the right answer from various textbooks once I got to college and had access to university libraries, but the questions are simple enough that there's no particular reason a high school student shouldn't be able to understand the answers; it's just an issue of finding the answer, so we'll take a look at how easy these answers are to find. The third kind of query is a local query for information I happened to want to get as I was writing this post.

In grading the queries, there's going to be some subjectivity here because, for example, it's not objectively clear if it's better to have moderately relevant results with no scams or very relevant results mixed interspersed with scams that try to install badware or trick you into giving up your credit card info to pay for something you shouldn't pay for. For the purposes of this post, I'm considering scams to be fairly bad, so in that specific example, I'd rate the moderately relevant results above the very relevant results that have scams mixed in. As with my other posts that have some kind of subjective ranking, there's both a short summary as well as a detailed description of results, so you can rank services yourself, if you like.

In the table below, each column is a query and each row is a search engine or ChatGPT. Results are rated (from worst to best) Terrible, Very Bad, Bad, Ok, Good, and Great, with worse results being more red and better results being more blue.

The queries are:

download youtube videos
ad blocker
download firefox
Why do wider tires have better grip?
Why do they keep making cpu transistors smaller?
vancouver snow forecast winter 2023

	YouTube	Adblock	Firefox	Tire	CPU	Snow
Marginalia	Ok	Good	Ok	Bad	Bad	Bad
ChatGPT	V. Bad	Great	Good	V. Bad	V. Bad	Bad
Mwmbl	Bad	Bad	Bad	Bad	Bad	Bad
Kagi	Bad	V. Bad	Great	Terrible	Bad	Terrible
Google	Terrible	V. Bad	Bad	Bad	Bad	Terrible
Bing	Terrible	Terrible	Great	Terrible	Ok	Terrible

Marginalia does relatively well by sometimes providing decent but not great answers and then providing no answers or very obviously irrelevant answers to the questions it can't answer, with a relatively low rate of scams, lower than any other search engine (although, for these queries, ChatGPT returns zero scams and Marginalia returns some).

Interestingly, Mwmbl lets users directly edit search result rankings. I did this for one query, which would score "Great" if it was scored after my edit, but it's easy to do well on a benchmark when you optimize specifically for the benchmark, so Mwmbl's scores are without my edits to the ranking criteria.

One thing I found interesting about the Google results was that, in addition to Google's noted propensity to return recent results, there was a strong propensity to return recent youtube videos. This caused us to get videos that seem quite useless for anybody, except perhaps the maker of the video, who appears to be attempting to get ad revenue from the video. For example, when searching for "ad blocker", one of the youtube results was a video where the person rambles for 93 seconds about how you should use an ad blocker and then googles "ad blocker extension". They then click on the first result and incorrectly say that "it's officially from Google", i.e., the ad blocker is either made by Google or has some kind of official Google seal of approval, because it's the first result. They then ramble for another 40 seconds as they install the ad blocker. After it's installed, they incorrectly state "this is basically one of the most effective ad blocker [sic] on Google Chrome". The video has 14k views. For reference, Steve Yegge spent a year making high-effort videos and his most viewed video has 8k views, with a typical view count below 2k. This person who's gaming the algorithm by making low quality videos on topics they know nothing about, who's part of the cottage industry of people making videos taking advantage of Google's algorithm prioritizing recent content regardless of quality, is dominating Steve Yegge's videos because they've found search terms that you can rank for if you put anything up. We'll discuss other Google quirks in more detail below.

ChatGPT does its usual thing and impressively outperforms its more traditional competitors in one case, does an ok job in another case, refuses to really answer the question in another case, and "hallucinates" nonsense for a number of queries (as usual for ChatGPT, random perturbations can significantly change the results⁴). It's common to criticize ChatGPT for its hallucinations and, while I don't think that's unfair, as we noted in this 2015, pre-LLM post on AI, I find this general class of criticism to be overrated in that humans and traditional computer systems make the exact same mistakes.

In this case, search engines return various kinds of hallucinated results. In the snow forecast example, we got deliberately fabricated results, one intended to drive ad revenue through shady ads on a fake forecast site, and another intended to trick the user into thinking that the forecast indicates a cold, snowy, winter (the opposite of the actual forecast), seemingly in order to get the user to sign up for unnecessary snow removal services. Other deliberately fabricated results include a site that's intended to look like an objective review site that's actually a fake site designed to funnel you into installing a specific ad blocker, where the ad blocker they funnel you to appears to be a scammy one that tries to get you to pay for ad blocking and doesn't let you unsubscribe, a fake "organic" blog post trying to get you to install a chrome extension that exposes all of your shopping to some service (in many cases, it's not possible to tell if a blog post is a fake or shill post, but in this case, they hosted the fake blog post on the domain for the product and, although it's designed to look like there's an entire blog on the topic, there isn't — it's just this one fake blog post), etc.

There were also many results which don't appear to be deliberately fraudulent and are just run-of-the-mill SEO garbage designed to farm ad clicks. These seem to mostly be pre-LLM sites, so they don't read quite like ChatGPT hallucinations, but they're not fundamentally different. Sometimes the goal of these sites is to get users to click on ads that actually scam the user, and sometimes the goal appears to be to generate clicks to non-scam ads. Search engines also returned many seemingly non-deliberate human hallucinations, where people confidently stated incorrect answers in places where user content is highlighted, like quora, reddit, and stack exchange.

On these queries, even ignoring anything that looks like LLM-generated text, I'd rate the major search engines (Google and Bing) as somewhat worse than ChatGPT in terms of returning various kinds of hallucinated or hallucination-adjacent results. While I don't think concerns about LLM hallucinations are illegitimate, the traditional ecosystem has the problem that the system highly incentivizes putting whatever is most profitable for the software supply chain in front of the user which is, in general, quite different from the best result.

For example, if your app store allows "you might also like" recommendations, the most valuable ad slot for apps about gambling addiction management will be gambling apps. Allowing gambling ads on an addiction management app is too blatantly user-hostile for any company deliberately allow today, but of course companies that make gambling apps will try to game the system to break through the filtering and they sometimes succeed. And for web search, I just tried this again on the web and one of the two major search engines returned, as a top result, ad-laden SEO blogspam for addiction management. At the top of the page is a multi-part ad, with the top two links being "GAMES THAT PAY REAL MONEY" and "GAMES THAT PAY REAL CASH". In general, I was getting localized results (lots of .ca domains since I'm in Canada), so you may get somewhat different results if you try this yourself.

Similarly, if the best result is a good, free, ad blocker like ublock origin, the top ad slot is worth a lot more to a company that makes an ad blocker designed to trick you into paying for a lower quality ad blocker with a nearly-uncancellable subscription, so the scam ad blocker is going to outbid the free ad blocker for the top ad slots. These kinds of companies also have a lot more resources to spend on direct SEO, as well as indirect SEO activities like marketing so, unless search engines mount a more effective effort to combat the profit motive, the top results will go to paid ad blockers even though the paid ad blockers are generally significantly worse for users than free ad blockers. If you talk to people who work on ranking, a lot of the biggest ranking signals are derived from clicks and engagement, but this will only drive users to the best results when users are sophisticated enough to know what the best results are, which they generally aren't. Human raters also rate page quality, but this has the exact same problem.

Many Google employees have told me that ads are actually good because they inform the user about options the user wouldn't have otherwise known about, but anyone who tries browsing without an ad blocker will see ads that are various kinds of misleading, ads that try to trick or entrap the user in various ways, by pretending to be a window, or advertising "GAMES THAT PAY REAL CASH" at the top of a page on battling gambling addiction, which has managed to SEO itself to a high ranking on gambling addiction searches. In principle, these problems could be mitigated with enough resources, but we can observe that trillion dollar companies have chosen not to invest enough resources combating SEO, spam, etc., that these kinds of scam ads are rarely seen. Instead, a number of top results are actually ads that direct you to scams.

In their original Page Rank paper, Sergei Brin and Larry Page noted that ad-based search is inherently not incentive aligned with providing good results:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the Consumers.

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who "deserves" to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. ... This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline’s homepage when the airline’s name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines ... we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.

Of course, Google is now dominated by ads and, despite specifically calling out the insidiousness of user conflating real results with paid results, both Google and Bing have made ads look more and more like real search results, to the point that most users usually won't know that they're clicking on ads and not real search results. By the way, this propensity for users to think that everything is an "organic" search result is the reason that, in this post, results are ordered by the order the appear on the page, so if four ads appear above the first organic result, the four ads will be rank 1-4 and the organic result will be ranked 5. I've heard Google employees say that AMP didn't impact search ranking because it "only" controlled what results went into the "carousel" that appeared above search results, as if inserting a carousel and then a bunch of ads above results, pushing results down below the fold, has no impact on how the user interacts with results. It's also common to see search engines ransoming the top slot for companies, so that companies that don't buy the ad for their own name end up with searches for that company putting their competitors at the top, which is also said to not impact search result ranking, a technically correct claim that's basically meaningless to the median user.

When I tried running the query from the paper, "cellular phone" (no quotes) and, the top result was a Google Store link to buy Google's own Pixel 7, with the rest of the top results being various Android phones sold on Amazon. That's followed by the Wikipedia page for Mobile Phone, and then a series of commercial results all trying to sell you phones or SEO-spam trying to get you to click on ads or buy phones via their links (the next 7 results were commercial, with the next result after that being an ad-laden SEO blogspam page for the definition of a cell phone with ads of cell phones on it, followed by 3 more commercial results, followed by another ad-laden definition of a phone). The commercial links seem very low quality, e.g., the top link below the carousel after wikipedia is Best Buy's Canadian mobile phone page. The first two products there are an ad slots for eufy's version of the AirTag. The next result is for a monthly financed iPhone that's tied to Rogers, the next for a monthly financed Samsung phone that's tied to TELUS, then we have Samsung's AirTag, an monthly financed iPhone tied to Freedom Mobile, a monthly financed iPhone tied to Freedom mobile in a different color, a monthly financed iPhone tied to Rogers, a screen protector for the iPhone 13, another Samsung AirTag product, an unlocked iPhone 12, a Samsung wall charger, etc.; it's an extremely low quality result with products that people shouldn't be buying (and, based on the number of reviews, aren't buying — the modal number of reviews of the top products is 0 and the median is 1 or 2 even though there are plenty of things people do actually buy from Best Buy Canada and plenty of products that have lots of reviews). The other commercial results that show up are also generally extremely low quality results. The result that Sergei and Larry suggested was a great top result, "The Effect of Cellular Phone Use Upon Driver Attention", is nowhere to be seen, buried beneath an avalanche of commercial results. On the other side of things, Google has also gotten into the action by buying ads that trick users, such as paying for an installer to try to trick users into installing Chrome over Firefox.

Anyway, after looking at the results of our test queries, some questions that come to mind are:

How is Marginalia, a search engine built by a single person, so good?
Can Marginalia or another small search engine displace Google for mainstream users?
Can a collection of small search engines provide better results than Google?
Will Mwmbl's user-curation approach work?
Would a search engine like 1996-Metacrawler, which aggregates results from multiple search engines, ChatGPT, Bard, etc., significantly outperform Google?

The first question could easily be its own post and this post is already 17000 words, so maybe we'll examine it another time. We've previously noted that some individuals can be very productive, but of course the details vary in each case.

On the second question, we looked at a similar question in 2016, both the general version, "I could reproduce this billion dollar company in a weekend", as well as specific comments about how open source software would make it trivial to surpass Google any day now, such as

Nowadays, most any technology you need is indeed available in OSS and in state of the art. Allow me to plug meta64.com (my own company) as an example. I am using Lucene to index large numbers of news articles, and provide search into them, by searching a Lucene index generated by simple scraping of RSS-crawled content. I would claim that the Lucene technology is near optimal, and this search approach I'm using is nearly identical to what a Google would need to employ. The only true technology advantage Google has is in the sheer number of servers they can put online, which is prohibitively expensive for us small guys. But from a software standpoint, Google will be overtaken by technologies like mine over the next 10 years I predict.

and

Scaling things is always a challenge but as long as Lucene keeps getting better and better there is going to be a point where Google's advantage becomes irrelevant and we can cluster Lucene nodes and distribute search related computations on top and then use something like Hadoop to implement our own open source ranking algorithms. We're not there yet but technology only gets better over time and the choices we as developers make also matter. Even though Amazon and Google look like unbeatable giants now don't discount what incremental improvements can accomplish over a long stretch of time and in technology it's not even that long a stretch. It wasn't very long ago when Windows was the reigning champion. Where is Windows now?

In that 2016 post, we saw that people who thought that open source solutions were set to surpass Google any day now appeared to have no idea how many hard problems must be solved to make a mainstream competitor to Google, including real-time indexing of rapidly-updated sites, like Twitter, newspapers, etc., as well as table-stakes level NLP, which is extremely non-trivial. Since 2016, these problems have gotten significantly harder as there's more real-time content to index and users expect much better NLP. The number of things people expect out of their search engine has increased as well, making the problem harder still, so it still appears to be quite difficult to displace Google as a mainstream search engine for, say, a billion users.

On the other hand, if you want to make a useful search engine for a small number of users, that seems easier than ever because Google returns worse results than it used to for many queries. In our test queries, we saw a number of queries where many or most top results were filled with SEO garbage, a problem that was significantly worse than it was a decade ago, even before the rise of LLMs and that continues to get worse. I typically use search engines in a way that doesn't run into this, but when I look at what "normal" users query or if I try naive queries myself, as I did in this post, most results are quite poor, which didn't used to be true.

Another place Google now falls over for me is when finding non-popular pages. I often find that, when I want to find a web page and I correctly remember the contents of the page, even if I do an exact string search, Google won't return the page. Either the page isn't indexed, or the page is effectively not indexed because it lives in some slow corner of the index that doesn't return in time. In order to find the page, I have to remember some text in a page that links to the page (often many clicks removed from the actual page, not just one, so I'm really remembering a page that links to a page that links to a page that links to a page that links to a page and then using archive.org to traverse the links that are now dead), search for that, and then manually navigate the link graph to get to the page. This basically never happened when I searched for something in 2005 and rarely happened in 2015, but this now happens a large fraction of the time I'm looking for something. Even in 2015, Google wasn't actually comprehensive. Just for example, Google search didn't index every tweet. But, at the time, I found Google search better at searching for tweets than Twitter search and I basically never ran across a tweet I wanted to find that wasn't indexed by Google. But now, most of the tweets I want to find aren't returned by Google search⁵, even when I search for "[exact string from tweet] site:twitter.com". In the original Page Rank paper, Sergei and Larry said "Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now." (and that, while machines can generate an effectively infinite amount of content, just indexing human-generated content seems very useful). Pre-LLM, Google certainly had the resources to index every tweet as well as every human generated utterance on every public website, but they seem to have chosen to devote their resources elsewhere and, relative to its size, the public web appears less indexed than ever, or at least less indexed than it's been since the very early days of web search.

Back when Google returned decent results for simple queries and indexed almost any public page I'd want to find, it would've been very difficult for an independent search engine to return results that I find better than Google's. Marginalia in 2016 would've been nothing more than a curiosity for me since Google would give good-enough results for basically anything where Marginalia returns decent results, and Google would give me the correct result in queries for every obscure page I searched for, something that would be extremely difficult for a small engine. But now that Google effectively doesn't index many pages I want to search for, the relatively small indices that independent search engines have doesn't make them non-starters for me and some of them return less SEO garbage than Google, making them better for my use since I generally don't care about real-time results, don't need fancy NLP (and find that much of it actually makes search results worse for me), don't need shopping integrated into my search results, rarely need image search with understanding of images, etc.

On the question of whether or not a collection of small search engines can provide better results than Google for a lot of users, I don't think this is much of a question because the answer has been a resounding "yes" for years. However, many people don't believe this is so. For example, a Google TLM replied to the bluesky thought leader at the top of this post with

Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.

And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?

presumably referring to arguments like Hillel Wayne's "Algorithm Monocultures", to which our bluesky thought leader replied

like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!

In the past, we looked at some examples where the market leader provides a poor product and various other players, often tiny, provide better products and in a future post, we'll look at how economies of scale and diseconomies of scale interact in various areas for tech but, for this post, suffice it to say that it's clear that despite the common "econ 101" cocktail party idea that economies of scale should be the dominant factor for search quality, that doesn't appear to be the case when we look at actual results.

On the question of whether or not Mwmbl's user-curated results can work, I would guess no, or at least not without a lot more moderation. Just browsing to Mwmbl shows the last edit to ranking was by user "betest", who added some kind of blogspam as the top entry for "RSS". It appears to be possible to revert the change, but there's no easily findable way to report the change or the user as spammy.

On the question of whether or not something like Metacrawler, which aggregated results from multiple search engines, would produce superior results today, that's arguably irrelevant since it would either be impossible to legally run as a commercial service or require prohibitive licensing fees, but it seems plausible that, from a technical standpoint, a modern metacrawler would be fairly good today. Metacrawler quickly became irrelevant because Google returned significantly better results than you would get by aggregating results from other search engines, but it doesn't seem like that's the case today.

Going back to the debate between folks like Xe, who believe that straightforward search queries are inundated with crap, and our thought leader, who believes that "the rending of garments about how even google search is terrible now is pretty overblown", it appears that Xe is correct. Although Google doesn't publicly provide the ability to see what was historically returned for queries, many people remember when straightforward queries generally returned good results. One of the reasons Google took off so quickly in the 90s, even among expert users of AltaVista, who'd become very adept at adding all sorts of qualifiers to queries to get good results, was that you didn't have to do that with Google. But we've now come full circle and we need to add qualifiers, restrict our search to specific sites, etc., to get good results from Google on what used to be simple queries. If anything, we've gone well past full circle since the contortions we need to get good results are a lot more involved than they were in the AltaVista days.

Thanks to Laurence Tratt, Heath Borders, Justin Blank, Brian Swetland, Viktor Lofgren (who, BTW, I didn't know before writing this post — I only reached out to him to discuss the Marginalia search results after running the queries), Misha Yagudin, @hpincket@fosstodon.org, Jeremey Kun, and Yossi Kreinin for comments/corrections/discussion

Appendix: Other search engines

DuckDuckGo: in the past, when I've compared DDG to Bing while using an ad blocker, the results have been very similar. I also tried DDG here and, removing the Bing ads, the results aren't as similar as they used to be, but they were still similar enough that it didn't seem worth listing DDG results. I use DDG as my default search engine and I think, like Google, it works fine if you know how to query but, for the kinds of naive queries in this post, it doesn't fare particularly well.
wiby.me: Like Marginalia, this is another search engine made for finding relatively obscure results. I tried four of the above queries on wiby and the results were interesting, in that they were really different than what I got from any other search engine, but wiby didn't return relevant results for the queries I tried.
searchmysite.net: Somewhat relevant results for some queries, but not as relevant as Marginalia. Many fewer scams and ad-laden pages than Google, Bing, and Kagi.
indieweb-search.jamesg.blog: seemed to be having an outage. "Your request could not be processed due to a server error." for every query.
Teclis: The search box is still there, but any query results in "Teclis.com is closed due to bot abuse. Teclis results are still available through Kagi's search results, explicitly through the 'Non-commercial Web' lens and also as an API.". A note on the front page reads "Teclis results are disabled on the site due to insane amount of bot traffic (99.9% traffic were bots)."

Appendix: queries that return good results

I think that most programmers are likely to be able to get good results to every query, except perhaps the tire width vs. grip query, so here's how I found an ok answer to the tire query:

I tried a youtube search, since a lot of the best car-related content is now youtube. A youtube video whose title claims to answer the question (the video doesn't actually answer the question) has a comment recommending Carroll Smith's book "Tune To Win". The comment claims that chapter 1 explains why wider tires have more grip, but I couldn't find an explanation anywhere in the book. Chapter 1 does note that race cars typically run wider tires than passenger cars and that passenger cars are moving towards having wider tires and it make some comments about slip angle that give a sketch of an intuitive reason for why you'd end up with better cornering with a wider contact patch, but I couldn't find a comment that explains differences in braking. Also, the book notes that the primary reason for the wider contact patch is that it (indirectly) allows for more less heat buildup, which then lets you design tires that operate over a narrower temperature range, which allows for softer rubber. That may be true, but it doesn't explain much of the observed behavior one might wonder about.

Tune to Win recommends Kummer's The Unified Theory of Tire and Rubber Friction and Hays and Brooke's (actually Browne, but Smith incorrectly says Brooke) The Physics of Tire Traction. Neither of these really explained what's happening either, but looking for similar books turned up Milliken and Millken's Race Car Vehicle Dynamics, which also didn't really explain why but seemed closer to having an explanation. Looking for books similar to Race Car Vehicle Dynamics turned up Guiggiani's The Science of Vehicle Dynamics, which did get at how to think about and model a number of related factors. The last chapter of Guiggiani's book refers to something called the "brush model" (of tires) and searching for "brush model tire width" turned up a reference to Pacejka's Tire and Vehicle Dynamics, which does start to explain why wider tires have better grip and what kind of modeling of tire and vehicle dynamics you need to do to explain easily observed tire behavior.

As we've noted, people have different tricks for getting good results so, if you have a better way of getting a good result here, I'd be interested in hearing about it. But note that, basically every time I have a post that notes that something doesn't work, the most common suggestion will be to do something that's commonly suggested that doesn't work, even though the post explicitly notes that the commonly suggested thing doesn't work. For example, the most common comment I receive about this post on filesystem correctness is that you can get around all of this stuff by doing the rename trick, even though the post explicitly notes that this doesn't work, explains why it doesn't work, and references a paper which discusses why it doesn't work. A few years later, I gave an expanded talk on the subject, where I noted that people kept suggesting this thing that doesn't work and the most common comment I get on the talk is that you don't need to bother with all of this stuff because you can just do the rename trick (and no, ext4 having auto_da_alloc doesn't mean that this works since you can only do it if you check that you're on a compatible filesystem which automatically replaces the incorrect code with correct code, at which point it's simpler to just write the correct code). If you have a suggestion for the reason wider tires have better grip or for a search which turns up an explanation, please consider making sure that the explanation is not one of the standard incorrect explanations noted in this post and that the explanation can account for all of the behavior that one must be able to account for if one is explaining this phenomenon.

On how to get good results for other queries, since this post is already 17000 words, I'll leave that for a future post on how expert vs. non-expert computer users interact with computers.

Appendix: summary of query results

For each question, answers are ordered from best to worst, with the metric being my subjective impression of how good the result is. These queries were mostly run in November 2023, although a couple were run in mid-December. When I'm running queries, I very rarely write natural language queries myself. However, normal users often write natural language queries, so I arbitrarily did the "Tire" and "Snow" queries as natural queries. Continuing with the theme of running simple, naive, queries, we used the free version of ChatGPT for this post, which means the queries were run through ChatGPT 3.5. Ideally, we'd run the full matrix of queries using keyword and natural language queries for each query, run a lot more queries, etc., but this post is already 17000 words (converting to pages of a standard length book, that would be something like 70 pages), so running the full matrix of queries with a few more queries would pretty quickly turn this into a book-length post. For work and for certain kinds of data analysis, I'll sometimes do projects that are that comprehensive or more comprehensive, but here, we can't cover anything resembling a comprehensive set of queries and the best we can do is to just try a handful of queries that seem representative and use our judgment to decide if this matches the kind of behavior we and other people generally see, so I don't think it's worth doing something like 4x the work to cover marginally more ground.

For the search engines, all queries were run in a fresh incognito window with cleared cookies, with the exception of Kagi, which doesn't allow logged-out searches. For Kagi, the queries were done with a fresh account with no custom personalization or filters, although they were done in sequence with the same account, so it's possible some kind of personalized ranking was applied to the later queries based on the clicks in the earlier queries. These queries were done in Vancouver, BC, which seems to have applied some kind of localized ranking on some search engines.

download youtube videos
- Ideally, the top hit would be yt-dlp or a thin, graphical, wrapper around yt-dlp. Links to youtube-dl or other less frequently updated projects would also be ok.
- Great results (yt-dlp as a top hit, maybe with youtube-dl in there somewhere, and no scams): none
- Good results (youtube-dl as a top hit, maybe with yt-dlp in there somewhere, and no scams): none
- Ok results (youtube-dl as a top hit, maybe with yt-dlp in there somewhere, and fewer scams than other search engines):
  - Marginalia: Top link is for youtube-dl. Most links aren't relevant. Many fewer scams than the big search engines
- Bad results (has some useful links, but also links to a lot of scams)
  - Mwmbl: Some links to bad sites and scams, but fewer than the big search engines. Also has one indirect link to youtube-dl in the top 10 and one for a GUI for youtube-dl
  - Kagi: Mostly links to scammy sites but does have, a couple pages down, a web.archive.org link to the 2010 version of youtube-dl
- Very bad results (fails to return any kind of useful result)
  - ChatGPT: basically refuses to answer the question, although you can probably prompt engineer your way to an answer if you don't just naively ask the question you want answered
- Terrible results (fails to return any kind of useful result and is full of scams:
  - Google: Mostly links to sites that try to scam you or charge you for a worse version of free software. Some links to ad-laden listicles which don't have good suggestions. Zero links to good results. Also links to various youtube videos that are the youtube equivalent of blogspam.
  - Bing: Mostly links to sites that try to scam you or charge you for a worse version of free software. Some links to ad-laden listicles which don't have good suggestions. Arguably zero links to good results (although one could make a case that result #10 is an ok result despite seeming to be malware).
ad blocker
- Ideally, the top link would be to ublock origin. Failing that, having any link to ublock origin would be good
- Great results (ublock origin is top result, no scams):
  - ChatGPT: First suggestion is ublock origin
- Good results (ublock origin is high up, but not the top result; results above ublock origin are either obviously not ad blockers or basically work without payment even if they're not as good as ublock origin; no links that directly try to scam you): none
- Ok results (ublock origin is in there somewhere, fewer scams than other search engines with not many scams)
  - Marginalia: 3rd and 4th results gets you to ublock origin and 8th result is ublock origin. Nothing that appears to try to scam you directly and "only" one link to some kind of SEO ad farm scam (which is much better than the major search engines)
- Bad results (no links to ublock origin and mostly links to things that paywall good features or ad blockers that deliberately let ads through by default):
  - Mwmbl: Lots of irrelevant links and some links to ghostery. One scam link, so fewer scams than commercial search engines
- Very bad results (exclusively or almost exclusively link to ad blockers that paywall good features or, by default, deliberately let through ads)
  - Google: lots of links to ad blockers that "participate in the Acceptable Ads program, where publishers agree to ensure their ads meet certain criteria" (not mentioned in the text, but explained elsewhere if you look into it, so that the main revenue source for companies that do this is advertisers paying the "ad blocker" company to not block their ads, making the "ad blocker" not only not an ad blocker, but very much not incentive aligned with users. Some links to things that appear to be scams. Zero links to ublock origin. Also links to various youtube videos that are the youtube equivalent of blogspam.
  - Kagi: similar to Google, but with more scams, though fewer than Bing
- Terrible results (exclusively or almost exclusively link to ad blockers that paywall good features or, by default, deliberately let through ads and has a significant number of scams):
  - Bing: similar to Google, but with more scams and without youtube videospam
download Firefox
- Ideally, we'd get links to download firefox with no fake or scam links
- Great results (links to download firefox; no scams):
  - Bing: links to download Firefox
  - Mwmbl: links to download firefox
  - Kagi: links to download firefox
- Good:
  - ChatGPT: this is a bit funny to categorize, since these are technically incorrect instructions, but a human should easily be able to decode the instructions and download firefox
- Ok results (some kind of indirect links to download firefox; no scams):
  - Marginalia: indirect links to download Firefox instructions to get to a firefox download
- Bad results (links to download firefox, with scams):
  - Google: top links are all legitimate, but the #7 result is a scam that tries to get you to install badware and the #10 result is an ad that appears to be some kind of scam that wants your credit card info.
Why do wider tires have better grip?
- Ideally, would link to an explanation that clearly explains why and doesn't have an incomplete explanation that can't explain a lot of commonly observed behavior
- Great / Good / Ok results: none
- Bad results (no results or a very small number of obviously incorrect results):
  - Mwmbl: one obviously incorrect result and no other results
  - Marginalia: two obviously incorrect results and no other results
- Very bad results: (a very small number of semi-plausible incorrect results)
  - ChatGPT: standard ChatGPT "hallucination" that's probably plausible to a lot of people (it sounds like a lot of incorrect internet comments on the topic, but better written)
- Terrible results (lots of semi-plausible incorrect results, often on ad farms):
  - Google / Bing / Kagi: incorrect ad-laden results with the usual rate of scammy ads
Why do they keep making cpu transistors smaller?
- Ideally, would link to an explanation that clearly explains why. The best explanations I've seen are in VLSI textbooks, but I've also seen very good explanations in lecture notes and slides
- Great results (links to a very good explanation, no scams): none
- Good results (links to an ok explanation, no scams): none
- Ok results (links to something you can then search on further and get a good explanation if you're good at searching and doesn't rank bad or misleading explanations above the ok explanation):
  - Bing: top set of links had a partial answer that could easily be turned into links to correct answers via more searching. Also had a lot of irrelevant answers and ad-laden SEO'd garbage
- Bad results (no results or a small number of obviously irrelevant results or lots of semi-plausible wrong results with an ok result somewhere):
  - Marginalia: no answers
  - Mwmbl: one obviously irrelevant answer
  - Google: 5th link has the right keywords to maybe find the right answer with further searches. Most links have misleading or incorrect partial answers. Lots of links to Quora, which don't answer the question. Also lots of links to other bad SEO'd answers
  - Kagi: 10th link has a fairly direct path to getting the correct answer, if you scroll down far enough on the 10th link. Other links aren't good.
- Very bad results:
  - ChatGPT: doesn't really answer the question. Asking ChatGPT to explain its answers further causes it "hallucinate" incorrect reasons.
vancouver snow forecast winter 2023
- I'm not sure what the ideal answer is, but a pretty good one would be to Environment Canada's snow forecast, predicting significantly below normal snow (and above normal temperatures)
- Great results (links to Environment Canada winter 2023 multi-month snow forecast as top result or something equivalently good): none
- Good results: none
- Ok results (links to some kind of semi-plausible winter snow forecast that isn't just made-up garbage to drive ad clicks): none
- Bad results (no results or obviously irrelevant results):
  - Marginalia: no results
  - ChatGPT: incorrect results, but when I accidentally prepended my question with "User\n", then it returned a link to the right website (but in a way that would make it quite difficult to navigate to a decent result), so perhaps a slightly different prompt would pseudo-randomly cause a ok result here?
  - Mwmbl: a bunch of obviously irrelevant results
- Very bad results: none
- Terrible results (links to deliberately faked forecast results):
  - Bing: mostly irrelevant results. The top seemingly-relevant result is the 5th link, but it appears to be some kind of scam site that fabricates fake weather forecasts and makes money by serving ads on the heavily SEO'd site
  - Kagi: top 4 results are from the scam forecast site that's Bing's 5th link
  - Google: mostly irrelevant results and the #1 result is a fake answer from a local snow removal company that projects significant snow and cold weather in an attempt to get you to unnecessarily buy snow removal service for the year. Other results are SEO'd garbage that's full of ads

Appendix: detailed query results

Download youtube videos

For our first query, we'll search "download youtube videos" (Xe's suggested search term, "youtube downloader" returns very similar results). The ideal result is yt-dlp or a thin, free, wrapper around yt-dlp. yt-dlp is a fork of youtube-dlc, which is a now defunct fork of youtube-dl, which seems to have very few updates nowadays.. A link to one of these older downloaders also seems ok if they still work.

Google

Some youtube downloader site. Has lots of assurances that the website and the tool are safe because they've been checked by "Norton SafeWeb". Interacting with the site at all prompts you to install a browser extension and enable notifications. Trying to download any video gives you a full page pop-over for extension installation for something called CyberShield. There appears to be no way to dismiss the popover without clicking on something to try to install it. After going through the links but then choosing not to install CyberShield, no video downloads. Googling "cybershield chrome extension" returns a knowledge card with "Cyber Shield is a browser extension that claims to be a popup blocker but instead displays advertisements in the browser. When installed, this extension will open new tabs in the browser that display advertisements trying to sell software, push fake software updates, and tech support scams.", so CyberShield appears to be badware.
Some youtube downloader site. Interacting with the site causes a pop-up prompting you to download their browser extension. Putting a video URL in causes a pop-up to some scam site but does also cause the video to download, so it seems to be possible to download youtube videos here if you're careful not to engage with the scams the site tries to trick you into interacting with
PC Magazine listicle on ways to download videos from youtube. Top recommendations are paying for youtube downloads, VLC (which they note didn't work when they tried it), some $15/yr software, some $26/yr software, "FlixGrab", then a warning about how the downloader websites are often scammy and they don't recommend any downloader website. The article has more than one ad per suggestion.
Some youtube downloader site with shady pop-overs that try to trick you into clicking on ads before you even interact with the page
Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads
Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads, e.g., "Samantha 24, vancouver | I want sex, write to WhatsApp | Close / Continue". Clicking anything (any button, or anywhere else on the site tries to get you to install something called "Adblock Ultimate"
ZDNet ZDnet listicle. First suggestion is clipware, which apparently bundles a bunch of malware/adware/junkware with the installer: https://www.reddit.com/r/software/comments/w9o1by/warning_about_clipgrab/. The listicle is full of ads and has an autoplay video
[YouTube video] Over 2 minutes of ads followed by a video on how to buy youtube premium (2M views on video)
[YouTube video] Video that starts off by asking users to watch the whole video (some monetization thing?). The video tries to funnel you to some kind of software to download videos that costs money
[YouTube video] PC Magazine video saying that you probably don't "have to" download videos since you can use the share button, and then suggests reading their story (the one in result #3) on how to download videos
Some youtube downloader site with scam ads. Interacting with the site at all tries to get you to install "Adblock Ultimate"
Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads
Some youtube downloader site with scam ads

Out of 10 "normal" results, we have 9 that, in one way or another, try to get you to install badware or are linked to some other kind of ad scam. One page doesn't do this, but it also doesn't suggest the good, free, option for downloading youtube videos and instead suggests a number of paid solutions. We also had three youtube videos, all of which seem to be the video equivalent of SEO blogspam. Interestingly, we didn't get a lot of ads from Google itself despite that happening the last time I tried turning off my ad blocker to do some Google test queries.

Bing

Some youtube downloader site. This is google (2), which has ads for scam sites
[EXPLORE FURTHER ... "Recommended to you based on what's popular"] Some youtube download site, not one we saw from google. Site has multiple pulsing ads and bills itself as "50% off" for Christmas (this search was done in mid-November). Trying to download any video pulls up a fake progress bar with a "too slow? Try [our program] link". After a while, a link to download the video appears, but it's a trick, and when you click it, it tries to install "oWebster Search extension". Googling "oWebster Search extension" indicates that it's badware that hijacks your browser to show ads. Two of the top three hits are how to install the extension and the rest of the top hits are how to remove this badware. Many of the removal links are themselves scams that install other badware. After not installing this badware, clicking the download link again results in a pop-over that tries to get you to install the site's software. If you dismiss the pop-over and click the download link again, you just get the pop-over link again, so this site appears to be a pure scam that doesn't let you download videos
[EXPLORE FURTHER]. Interacting with the site pops up fake ads with photos of attractive women who allegedly want to chat with you. Clicking the video download button tries to get you to install a copycat ad blocker that displays extra pop-over ads. The site does seem to actually give you a video download, though
[EXPLORE FURTHER] Same as (3)
[EXPLORE FURTHER] Same as Google (1) (that NortonSafeWeb youtube downloader site that tries to scam you)
[EXPLORE FURTHER] A site that converts videos to MP4. I didn't check to see if the site works or is just a scam as the site doesn't even claim to let you download youtube videos
Google (1), again. That NortonSafeWeb youtube downloader site that tries to scam you.
[EXPLORE FURTHER] A link to youtube.com (the main page)
[EXPLORE FURTHER] Some youtube downloader site with a popover that tries to trick you into clicking on an ad. Closing that reveals 12 more ads. There's a scam ad that's made to look like a youtube downloader button. If you scroll past that, there's a text box and a button for trying to download a youtube video. Entering a valid URL results in an error saying there's no video that URL.
Gigantic card that actually has a download button. The download button is fake and just takes you to the site. The site loudly proclaims that the software is not adware, spyware, etc.. Quite a few internet commenters note that their antivirus software tags this software as malware. A lot of comments also indicate that the software doesn't work very well but sometimes works. The site for the software has a an embedded youtube video, which displays "This video has been removed for violating YouTube's Terms of Service". Oddly, the download links for mac and Linux are not for this software and in fact don't download anything at all and are installation instructions for youtube-dl; perhaps this makes sense if the windows version is actually malware. The windows download button takes you to a page that lets you download a windows executable. There's also a link to some kind of ad-laden page that tries to trick you into clicking on ads that look like normal buttons
PC magazine listicle
An ad for some youtube downloader program that claims "345,764,132 downloads today"; searching the name of this product on reddit seems to indicate that it's malware
Ad for some kind of paid downloader software

That's the end of the first page.

Like Google, no good results and a lot of scams and software that may not be a scam but is some kind of lightweight skin around an open source project that charges you instead of letting you use the software for free.

Marginalia

12-year old answer suggesting youtube-dl, which links to a URL which has been taken down and replaced with "Due to a ruling of the Hamburg Regional Court, access to this website is blocked."
Some SEO'd article, like you see on normal search engines
Leawo YouTube Downloader (I don't know what this is, but a quick search at least doesn't make it immediately obvious that this is some kind of badware, unlike the Google and Bing results)
Some SEO'd listicle, like you see on normal search engines
Bug report for some random software
Some random blogger's recommendation for "4K Video Downloader". A quick search seems to indicate that this isn't a scam or badware, but it does lock some features behind a paywall, and is therefore worse than yt-dlp or some free wrapper around yt-dlp
A blog post on how to install and use yt-dlp. The blogpost notes that it used to be about youtube-dl, but has been updated to yt-dlp.
More software that charges you for something you can get for free, although searching for this software on reddit turns up cracks for it
A listicle with bizarrely outdated recommendations, like RealPlayer. The entire blog seems to be full of garbage-quality listicles.
A script to download youtube videos for something called "keyboard maestro", which seems useful if you already use that software, but seems like a poor solution to this problem if you don't already use this software.

The best results by a large margin. The first link doesn't work, but you can easily get to youtube-dl from the first link. I certainly wouldn't try Leawo YouTube Downloader, but at least it's not so scammy that searching for the name of the project mostly returns results about how the project is some kind of badware or a scam, which is better than we got from Google or Bing. And we do get a recommendation with yt-dlp, with instructions in the results that's just a blog post from someone who wants to help people who are trying to download youtube videos.

Kagi

1. That NortonSafeWeb youtube downloader site. Interacting with the site at all prompts you to install a browser extension and enable notifications. Trying to download any video gives you a full page pop-over for extension installation for something called CyberShield. There appears to be no way to dismiss the popover without clicking on something to try to install it
2. Another link to that NortonSafeWeb youtube downloader site. For some reason, this one is tagged with "Dec 20, 2003", apparently indicating that the site is from Dec 20th 2003, although that's quite wrong.
3. Some youtube downloader site. Selecting any video to download pushes you to a site with scam ads.
4. Some youtube downloader site. Interacting with the site at all pops up multiple ads that link to scams and the page wants to enable notifications. A pop-up then appears on top of the ads that says "Ad removed" with a link for details. This is a scam link to another ad.
5. Another link to the above site
6-7. Under a subsection titled "Interesting Finds", there are links to two github repos. One is for transcribing youtube videos to text and the other is for using Google Takeout to backup photos from google photos or your own youtube channel
8. Some youtube downloader site.
9-13. Under a subsection titled "Blast from the Past", 4 irrelevant links and a link to youtube-dl's github page, but the 2010 version at archive.org
14. SEO blogspam for youtube help. Has a link that's allegedly for a "Greasemonkey script for downloading YouTube videos", but the link just goes to a page with scammy ads
15. Some software that charges you $5/mo to download videos from youtube

Mwmbl

Some youtube video downloader site, but one that no other search engine returned. There's a huge ad panel that displays "503 NA - Service Deprecating". The download link does nothing except for pop up some other ad panes that then disappear, leaving just the 503 "ad".
$20 software for downloading youtube videos
2016 blog post on how to install and use youtube-dl. Sidebar has two low quality ads which don't appear to be scams and the main body has two ads interspersed, making this extremely low on ads compared to analogous results we've seen from large search engines
Some youtube video download site. Has a giant banner claiming that it's "the only YouTube Downloader that is 100% ad-free and contains no popups.", which is probably not true, but the site does seem to be ad free and not have pop-ups. Download link seems to actually work.
Youtube video on how to install and use youtube-dlg (a GUI wrapper for youtube-dl) on Linux (this query was run from a Mac).
Link to what was a 2007 blogpost on how to download youtube videos, which automatically forwards to a 2020 ad-laden SEO blogspam listicle with bad suggestions. Article has two autoplay videos. Archive.org shows that the 2007 blog post had some reasonable options in it for the time, so this wasn't always a bad result.
A blog post on a major site that's actually a sponsored post trying to get you to a particular video downloader. Searching for comments on this on reddit indicate that users view the app as a waste of money that doesn't work. The site is also full of scammy and misleading ads for other products. E.g., I tried clicking on an ad that purports to save you money on "products". It loaded a fake "checking your computer" animation that supposedly checked my computer for compatibility with the extension and then another fake checking animation, after which I got a message saying that my computer is compatible and I'm eligible to save money. All I have to do is install this extension. Closing that window opens a new tab that reads "Hold up! Do you actually not want automated savings at checkout" with the options "Yes, Get Coupons" and "No, Don't Save". Clicking "No, Don't Save" is actually an ad that takes you back to a link that tries to get you to install a chrome extension.
That "Norton Safe Web" youtube downloader site, except that the link is wrong and is to the version of the site that purports to download instagram videos instead of the one that purports to download youtube videos.
Link to Google help explaining how you can download youtube videos that you personally uploaded
SEO blogspam. It immediately has a pop-over to get you to subscribe to their newsletter. Closing that gives you another pop-over with the options "Subscribe" and "later". Clicking "later" does actually dismiss the 2nd pop-over. After closing the pop-overs, the article has instructions on how to install some software for windows. Searching for reviews of the software returns comments like "This is a PUP/PUA that can download unwanted applications to your pc or even malicious applications."

Basically the same as Google or Bing.

ChatGPT

Since ChatGPT expects more conversational queries, we'll use the prompt "How can I download youtube videos?"

The first attempt, on a Monday at 10:38am PT returned "Our systems are a bit busy at the moment, please take a break and try again soon.". The second attempt returned an answer saying that one should not download videos without paying for YouTube Premium, but if you want to, you can use third-party apps and websites. Following up with the question "What are the best third-party apps and websites?" returned another warning that you shouldn't use third-party apps and websites, followed by the ironic-for-GPT warning,

I don't endorse or provide information on specific third-party apps or websites for downloading YouTube videos. It's essential to use caution and adhere to legal and ethical guidelines when it comes to online content.

ad blocker

For our next query, we'll try "ad blocker". We'd like to get ublock origin. Failing that, an ad blocker that, by default, blocks ads. Failing that, something that isn't a scam and also doesn't inject extra ads or its own ads. Although what's best may change at any given moment, comparisons I've seen that don't stack the deck have often seemed to show that ublock origin has the best or among the best performance, and ublock origin is free and blocks ads.

Google

"AdBlock — best ad blocker". Below the fold, notes "AdBlock participates in the Acceptable Ads program, so unobtrusive ads are not blocked", so this doesn't block all ads.
Adblock Plus | The world's #1 free ad blocker. Pages notes "Acceptable Ads are allowed by default to support websites", so this also does not block all ads by default
AdBlock. Page notes that " Since 2015, we have participated in the Acceptable Ads program, where publishers agree to ensure their ads meet certain criteria. Ads that are deemed non-intrusive are shown by default to AdBlock users", so this doesn't block all ads
"Adblock Plus - free ad blocker", same as (2), doesn't block all ads
"AdGuard — World's most advanced adblocker!" Page tries to sell you on some kind of paid software, "AdGuard for Mac". Searching for AdGuard turns up a post from this person looking for an ad blocker that blocks ads injected by AdGuard. It seems that you can download it for free, but then, if you don't subscribe, they give you more ads?
"AdBlock Pro" on safari store; has in-app purchases. It looks like you have to pay to unlock features like blocking videos
[YouTube] "How youtube is handling the adblock backlash". 30 second video with 15 second ad before the video. Video has no actual content
[YoutTube] "My thoughts on the youtube adblocker drama"
[YouTube] "How to Block Ads online in Google Chrome for FREE [2023]"; first comment on video is "your video doesnt [sic] tell how to stop Youtube adds [sic]". In the video, a person rambles for a bit and then googles ad blocker extension and then clicks the first link (same as our first link), saying, "If I can go ahead and go to my first website right here, so it's basically officially from Google .... [after installing, as a payment screen pops up asking you to pay $30 or a monthly or annual fee]"
"AdBlock for Mobile" on the App Store. It's rated 3.2* on the iOS store. Lots of reviews indicate that it doesn't really work
MalwareBytes ad blocker. A quick search indicates that it doesn't block all ads (unclear if that's deliberate or due to bugs)
"Block ads in Chrome | AdGuard ad blocker", same as (5)
[ad] NordVPN
[ad] "#1 Best Free Ad Blocker (2024) - 100% Free Ad Blocker." Immediately seems scammy in that it has a fake year (this query was run in mid-November 2023). This is for something called TOTAL Ad Block. Searching for TOTAL Ad Block turns up results indicating that it's a scammy app that doesn't let you unsubscribe and basically tries to steal your money 15 [ad] 100% Free & Easy Download - Automatic Ad Blocker. Actually for Avast browser and not an ad blocker. A quick search show that this browser has a history of being less secure than just running chromium and that it collects an unusually large amount of information from users.

No links to ublock origin. Some links to scams, though not nearly as many as when trying to get a youtube downloader. Lots of links to ad blockers that deliberately only block some ads by default.

Bing

1. [ad] "Automatic Ad Blocker | 100% Free & Easy Download". [link is actually to avast secure browser, so an entire browser and not an ad blocker; from a quick search, this appears to be a wrapper around chromium that [has a history of being less secure than just running chromium](https://palant.info/2020/01/13/pwning-avast-secure-browser-for-fun-and-profit/) [which collects an unusually large amount of information from users](https://palant.info/2019/10/28/avast-online-security-and-avast-secure-browser-are-spying-on-you/)].
2. [ad] "#1 Best Free Ad Blocker (2023) | 100% Free Ad Blocker". Has a pop-over nag window when you mouse over to the URL bar asking you to install it instead of navigating away. Something called TOTAL ad block. Apparently tries to get to sign up for a subscription [and then makes it very difficult to unsubscribe](https://www.reddit.com/r/Adblock/comments/1412m7l/total_adblock_peoples_experiencesopinions/) (apparently, you can't cancel without a phone call, and when you call and tell them to cancel, they still won't do it unless you threaten to issue a chargeback or block the payment from the bank)
3. [ad] "Best Ad Blocker (2023) | 100% Free Ad Blocker". Seems to be a fake review site that reviews various ad blockers; ublock origin is listed as #5 with 3.5 stars. TOTAL ad block is listed as #1 with 5 stars, is the only 5 stars ad blocker, has a banner that shows that it's the "#1 Free Ad Blocker", is award winning, etc.
If you then click the link to ublock origin, it takes you to a page that "shows" that ublock origin has 0 stars on trustpilot. There are multiple big buttons that say "click to start blocking ads" that try to get you to install TOTAL ad block. In the bottom right, in what looks like an ad slot, there's an image that says "visit site" for ublock origin. The link doesn't take you to ublock origin and instead takes you a site for [the fake ublock origin](https://www.reddit.com/r/ublock/comments/32mos6/ublock_vs_ublock_origin/).
4. [ad] "AVG Free Antivirus 2023 | 100% Free, Secure Download". This at least doesn't pretend to be an ad blocker of any kind.
5. [Explore content from adblockplus.org] A link to the adblock plus blog.
6. [Explore content from adblockplus.org] A link to a list of adblock plus features.
7. "Adblock Plus | The world's #1 free ad blocker".
8-13. Sublinks to various pages on the Adblock Plus site.

We're now three screens down from the result, so the equivalent of the above google results is just a bunch of ads and then links to one website. The note that something is an ad is much more subtle than I've seen on any other site. Given what we know about when users confuse ads with organic search results, it's likely that most users don't realize that the top results are ads and think that the links to scam ad blockers or the fake review site that tries to funnel you into installing a scam ad blocker are organic search results.

Marginalia

"Is ad-blocker software permissible?" from judaism.stackexchange.com
Blogspam for Ghosterty. Ghostery's pricing page notes that you have to pay for "No Private Sponsored Links", so it seems like some features are behind a pay wall. Wikipedia says "Since July 2018, with version 8.2, Ghostery shows advertisements of its own to users", but it seems like this might be opt-in?
https://shouldiblockads.com/. Explains why you might want to block ads. First recommendation is ublock origin
"What’s the best ad blocker for you? - Firefox Add-ons Blog". First recommendation is ublock origin. Also provides what appears to be accurate information about other ad blockers.
Blog post that's a personal account of why someone installed an ad blocker.
Opera (browser).
Blog post, anti-anti-adblocker polemic.
ublock origin.
Fairphone forum discussion on whether or not one should install an ad blocker.
SEO site blogspam (as in, the site is an SEO optimization site and this is blogspam designed to generate backlinks and funnel traffic to the site).

Probably the best result we've seen so far, in that the third and fourth results suggest ublock origin and the first result is very clearly not an ad blocker. It's unfortunate that the second result is blogspam for Ghostery, but this is still better than we see from Google and Bing.

Mwmbl

A bitly link to a "thinkpiece" on ad blocking from a VC thought leader.
A link to cryptojackingtest, which forwards to Opera (the browser).
A link to ghostery.
Another link to ghostery.
A link to something called 1blocker, which appears to be a paid ad blocker. Searching for reviews turns up comments like "I did 1blocker free trial and forgot to cancel so it signed me up for annual for $20 [sic]" (but comments indicate that the ad blocker does work).
Blogspam for Ad Guard. There's a banner ad offering 40% off this ad blocker.
An extremely ad-laden site that appears to be in the search results because it contains the text "ad blocker detected" if you use an ad blocker (I don't see this text on loading the page, but it's in the page preview on Mwmbl). The first page is literally just ads with a "read more" button. Clicking "read more" takes you to a different page that's full of ads that also has the cartoon, which is the "content".
Another site that appears to be in the search results because it contains the text "ad blocker detected".
Malwarebytes ad blocker, which doesn't appear to work.
HN comments for article on youtube ad blocker crackdown. Scrolling to the 41st comment returns a recommendation for ublock origin.

Mwmbl lets users suggest results, so I tried signing up to add ublock origin. Gmail put the sign-up email into my spam folder. After adding ublock origin to the search results, it's now the #1 result for "ad blocker" when I search logged out, from an incognito window and all other results are pushed down by one. As mentioned above, the score for Mwmbl is from before I edited the search results and not after.

Kagi

1. "Adblock Plus | The world's #1 free ad blocker".
2-11. Sublinks to other pages on the Adblock Plus website.
12. "AdBlock — best ad blocker".
13. "Adblock Plus - free ad blocker".
14. "YouTube’s Ad Blocker Crackdown", a blog post that quotes and links to discussions of people talking about the titular topic.
15-18. Under a section titled "Interesting Finds", three articles about youtube's crackdown on ad blockers. One has a full page pop-over trying to get you to install TOTAL Adblock with "Close" and "Open" buttons. The "Close" button does nothing and clicking any link or the open button takes to a page advertising TOTAL adblock. There appears to be no way to dismiss the ad and read the actual article without doing something like going to developer tools and deleting the ad elements. The fourth article is titled "The FBI now recommends using an ad blocker when searching the web" and 100% of the above the fold content is the header plus a giant ad. Scrolling down, there are a lot more ads.
19. "AdBlock".
20. Another link from the Adblock site, "Ad Blocker for Chrome - Download and Install AdBlock for Chrome Now!".
21-25. Under a section titled "Blast from the Past", optimal.com ad blocker, a medium article on how to subvert adblock, a blog post from a Mozillan titled "Why Ad Blockers Work" that's a response to Ars Technica's "Why Ad Blocking is devastating to the sites you love", "Why You Need a Network-Wide Ad-Blocker (Part 1)", and "A Popular Ad Blocker Also Helps the Ad Industry", subtitled "Millions of people use the tool Ghostery to block online tracking technology—some may not realize that it feeds data to the ad industry."

Similar quality to Google and Bing. Maybe halfway in between in terms of the number of links to scams.

ChatGPT

Here, we tried the prompt. How do I install the best ad blocker?

First suggestion is ublock origin. Second suggestion is adblock plus. This seems like the best result by a significant margin.

download firefox

Google

1-6. Links to download firefox.
7. Blogspam for firefox download with ads trying to trick you into installing badware.
8-9. Links to download firefox.
10 [ad] Some kind of shady site that claims to have firefox downloads, but where the downloads take you to other sites that try to get you to sign up for an account where they ask for personal information and your credit card number. Also pops up pop-over with window that does the above if you try to actually download firefox. At least one of the sites is some kind of gambling site, so this site might make money off of referring people to gambling sites?

Mostly good links, but 2 out of the top 10 links are scams. And we didn't have a repeat of this situation I saw in 2017, where Google paid to get ranked above Firefox in a search for Firefox. For search queries where almost every search engine returns a lot of scams, I might rate having 2 out of the top 10 links be scams as "Ok" or perhaps even better but, here, where most search engines return no fake or scam links, I'm rating this as "Bad". You could make a case for "Ok" or "Good" here by saying that the vast majority of users will click one of the top links and never get as far as the 7th link, but I think that if Google is confident enough that's the case that they view it as unproblematic that the 7th and 10th links are scams, they should just only serve up the top links.

Bing

1-12. Links to download firefox or closely related links.
13. [ad] Avast browser.

That's the entire first page. Seems pretty good. Nothing that looks like a scam.

Marginalia

1. "Is it better to download Firefox from the website or use the package manager?" on the UNIX stackexchange
2-9. Various links related to firefox, but not firefox downloads
10. "Internet Download Accelerator online help"

Definitely worse than Bing, since none of the links are to download Firefox. Depending on how highly you rate users not getting scammed vs. having the exact right link, this might be better or worse than Google. In this post, this scams are relatively highly weighted, so Marginalia ranks above Google here.

Mwmbl

1-7. Links to download firefox.
8. A link to a tumblr that has nothing to do with firefox. The title of the tumblr is "Love yourself, download firefox" (that's the title of the entire blog, not a particular blog post).
9. Link to download firefox nightly.
10. Extremely shady link that allegedly downloads firefox. Attempting to download the shady firefox pops up an ad that tries to trick you downloading Opera. I did not run either the Opera or Firefox binaries to see if they're legitimate.

kagi.com

1-3. Links to download firefox.
4-5. Under a heading titled "Interesting finds", a 404'd link to a tweet titled "What happens if you try to download and install Firefox on Windows" [which used to note that downloading Firefox on windows results in an OS-level pop-up that recommends Edge instead "to protect your pc"](https://web.archive.org/web/20220403104257/https://twitter.com/plexus/status/1510568329303445507) and some extremely ad-laden article (though, to its credit, the ads don't seem to be scam ads).
6. Link to download firefox.
7-10. 3 links to download very old versions of firefox, and a blog post about some kind of collaboration between firefox and ebay.
11. Mozilla homepage.
12. Link to download firefox.

Maybe halfway in between Bing and Marginalia. No scams, but a lot of irrelevant links. Unlike some of the larger search engines, these links are almost all to download the wrong version of firefox, e.g., I'm on a Mac and almost all of the links are for windows downloads.

ChatGPT

The prompt "How do I download firefox?" returned technically incorrect instructions on how to download firefox. The instructions did start with going to the correct site, at which point I think users are likely to be able to download firefox by looking at the site and ignoring the instructions. Seems vaguely similar to marginalia, in that you can get to a download by clicking some links, but it's not exactly the right result. However, I think users are almost certain to find the correct steps and only likely with Marginalia, so ChatGPT is rated more highly than Marginalia for this query.

Why do wider tires have better grip?

Any explanation that's correct must, a minimum, be consistent with the following:

Assuming a baseline of a moderately wide tire for the wheel size.
- Scaling both of these to make both wider than the OEM tire (but still running a setup that fits in the car without serious modifications) generally gives better dry braking and better lap times.
- In wet conditions, wider setups often have better braking distances (though this depends a lot on the specific setup) and better lap times, but also aquaplane at lower speeds.
- Just increasing the wheel width and using the same tire generally gives you better lap times, within reason.
- Just increasing the tire width and leaving wheel width fixed generally results in worse lap times.
Why tire pressure changes have the impact that they do (I'm not going to define terms in these bullets; if this text doesn't make sense to you, that's ok).
- At small slip angles, increasing tire pressure results in increased lateral force.
- In general, lowering tire pressure increases effective friction coefficient (within reason a semi-reasonable range).

This is one that has a lot of standard incorrect or incomplete answers, including:

Wider tires give you more grip because you get more surface area.
- Wider tires don't, at reasonable tire pressure, give you significantly more surface area.
Wider tires actually don't give you more grip because friction is surface area times a constant and surface area is mediated by air pressure.
- It's easily empirically observed that wider tires do, in fact, give you better handling and braking.
Wider tires let you use a softer compound, so the real reason wider tires give you more grip is via the softer compound.
- This could be part of an explanation, but I've generally seen this cited as the only explanation. However, wider tires give you more grip independent of having a softer compound. You can even observe this with the same tire by mounting the exact same tire on a wider wheel (within reason).
The shape of the contact patch when the tire is wider gives you better lateral grip due to [some mumbo jumbo], e.g., "tire load sensitivity" or "dynamic load".
- Ok, perhaps, but what's the mechanism that gives wider tires more grip when braking? And also, please explain the mumbo jumbo. For my goal of understanding why this happens, if you just use some word but don't explain the mechanism, this isn't fundamentally different than saying that wider tires have better grip due to magic.
  - When there's some kind of explanation of the mumbo jumbo, there will often be an explanation that only applies to aspect of increased grip, e.g., the explanation will really only apply to lateral grip and not explain why braking distances are decreased.

Google

1. A "knowledge card" that says "Bigger tires provide a wider contact area that optimizes their performance and traction.", which explains nothing. On clicking the link, it's SEO blogspam with many [incorrect statements, such as "Are wider tires better for snow traction? Or are narrow tires more reliable in the winter months? The simple answer is narrow tires!](https://mastodon.social/@danluu/111441790762754806) Tires with a smaller section width provide more grip in winter conditions. They place higher surface pressure against the road they are being driven on, enabling its snow and ice traction"
2. [Question dropdown] "do wider tires give you more grip?", which correctly says "On a dry road, wider tires will offer more grip than narrow ones, but the risk of aquaplaning will be higher with wide tires.". On clicking the link, there's no explanation of why, let alone an answer to the question we're asking
3. [Question dropdown] "Do bigger tires give you better traction?", which says "What Difference Does The Wheel Size Make? Larger wheels offer better traction, and because they have more rubber on the tire, this also means a better grip on the road", which has a nonsensical explanation of why. On clicking the link, the link appears to be talking about wheel diameter and is not only wrong, but actually answering the wrong question.
4. [Question dropdown] "Why do wider tires have more grip physics?", which then has some of the standard incorrect explanations.
5. "Do wider wheels improve handling?", which says "Wider wheels and wider tires will also lower your steering friction coefficient". On clicking the link, there's no explanation of why nor is there an answer to the question we're asking.
6. "What are the disadvantages of wider tires?", which says "Harder Handling & Steering". On clicking the link, there are multiple incorrect statements and no explanation of why.
7. "Would wider tires increase friction?", which says "Force can be stated as Pressure X Area. For a wide tire, the area is large but the force per unit area is small and vice versa. The force of friction is therefore the same whether the tire is wide or not.". Can't load the page due to a 502 error and the page isn't in archive.org, but this seems fine since the page appears to be wrong
8. "What is the advantage of 20 inch wheels over 18 inch wheels?" Answers a different question. On clicking the link, it's low quality SEO blogspam.
9. "Why do race cars have wide tires?", which says "Wider tires provide more resistance to slippery spots or grit on the road. Race tracks have gravel, dust, rubber beads and oil on them in spots that limit traction. By covering a larger width, the tires can handle small problems like that better. Wider tires have improved wear characteristics.". Perhaps technically correct, but fundamentally not the answer and highly misleading at best.
10-49. Other question dropdowns that are wrong. Usually both wrong and answering the wrong question, but sometimes giving a wrong answer to the right question and sometimes giving the right answer to the wrong question. I am just now realizing that clicking question dropdowns give you more question dropdowns.
50. "Why do wider tires get more grip? : r/cars". The person asks the question I'm asking, concluding with "This feels like a really dumb question because wider tires=more grip just seems intuitive, but I don't know the answer.". The top answer is total nonsense "The smaller surface area has more pressure but the same normal force as a larger surface area. If you distribute the same load across more area, each square inch of tire will have less force it's responsible for holding, and thus is less likely to be overcome by the force from the engine". The #2 answer is a classic reddit answer, "Yeah, take your science bs and throw it out the window.". The #3 answer has a vaguely plausible sounding answer to why wider tires have better lateral grip, but it's still misleading. Like many of the answers, the answer emphasizes how wider tires give you better lateral grip and has a lengthy explanation for why this should be the case, but wider tires also give you shorter braking distances and the provided explanation cannot explain why wider tires have shorter braking distances so must be missing a significant part of the puzzle. Anyway, none of the rest of the answers really even attempt to explain why
51-54. Other reddit answers bunched with this one, which also don't answer the question, although one of them links to https://www.brachengineering.com/content/publications/Wheel-Slip-Model-2006-Brach-Engineering.pdf, which has some good content, though it doesn't answer the question.
55. SEO blogspam for someone's youtube video; video doesn't answer the question.
56. Extremely ad-laden site with popovers that try to trick you into clicking on ads, etc.; has text I've seen on other pages that's been copied over to make an SEO ad farm (and the text has answers that are incorrect)

Bing

1. Knowledge card which incorrectly states "Larger contact patch with the ground."
2-4. Carousel where none of the links answer the question correctly. (3) from bing is (50) from google search results. (2) isn't wrong, but also doesn't answer the question. (3) is SEO blogspam for someone else's youtube video (same link as google.com 55). The video does not answer the question. (3) and (4) are literally the same link and also don't answer the question
5. "This is why wider tires equals more grip". SEO blogspam for someone else's youtube video. The youtube video does not answer the question.
6-10. [EXPLORE FURTHER] results. (6) is blatantly wrong, (7) is the same link as (3) and (4), (8) is (2), SEO blogspam for someone else's youtube video and the video doesn't answer the question, (9) is s SEO blogspam for someone else's youtube video and the video doesn't answer the question, (10) is generic SEO blogspam with lots of incorrect information
11. Same link as (2) and (8), still SEO blogspam for someone else's youtube video and the video doesn't answer the question
12-13 [EXPLORE FURTHER] results. (12) is some kind of SEO ad farm that tries to get you to make "fake" ad clicks (there are full screen popovers that, if you click them, cause you to click through some kind of ad to some normal site, giving revenue to whoever set up the ad farm). (13) is the website of the person who made one of the two videos that's a common target for SEO blogspam on this topic. It doesn't answer the question, but at least we have the actual source here.

From skimming further, many of the other links are the same links as above. No link appears to answer the question.

Marginalia

Original query returns zero results. Removing the question mark returns one single result, which is the same as (3) and (4) from bing.

Mwmbl

NYT article titled "Why Women Pay Higher Interest". This is the only returned result.

Removing the question mark returns an article about bike tires titled "Fat Tires During the Winter: What You Need to Know"

Kagi

A knowledge card that incorrectly reads "wider tire has a greater contact patch with the ground, so can provide traction."
(50) from google
Reddit question with many incorrect answers
Reddit question with many incorrect answers. Top answer is "The same reason that pressing your hand on the desk and sliding it takes more effort than doing the same with a finger. More rubber on the road = more friction".
(3) and (4) from bing
Youtube video titled "Do wider tyres give you more grip?". Clicking the video gives you 1:30 in ads before the video plays. The video is good, but it answers the question in the title of the video and not the question being asked of why this is the case. The first ad appears to be an ad revenue scam. The first link actually takes you to a second link, where any click takes you through some ad's referral link to a product.
"This is why wider tires equals more grip". SEO blogspam for (6)
SEO blogspam for another youtube video
SEO blogspam for (6)
Quora answer where top answer doesn't answer the question and I can't read all of the answers because I'm not logged in or aren't a premium member or something.
Google (56), stolen text from other sites and a site that has popovers that try to trick you into clicking ads
Pre-chat GPT nonsense text and a page that's full of ads. Unusually, the few ads that I clicked on seemed to be normal ads and not scams.
Blogspam for ad farm that has pop-overs that try to get you to install badware.
Page with ChatGPT-sounding nonsense. Has a "Last updated" timestamp that's sever-side generated to match the exact moment you navigated to the page. Page tries to trick you into clicking on ads with full-page popover. Ads don't seem to be scams, as far as I can tell.
Page which incorrectly states "In summary, a wider tire does not give better traction, it is the same traction similar to a more narrow tire.". Has some ads that get you to try to install badware.

ChatGPT

Provides a list of "hallucinated" reasons. The list of reasons has better grammar than most web search results, but still incorrect. It's not surprising that ChatGPT can't answer this question, since it often falls over on questions that are both easier to reason about and where the training data will contain many copies of the correct answer, e.g., Joss Fong noted that, when her niece asked ChatGPT about gravity, the response was nonsense: "... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier."

Overall, no search engine gives correct answers. Marginalia seems to be the best here in that it gives only a couple of links to wrong answers and no links to scams.

Why do they keep making cpu transistors smaller?

I had this question when I was in high school and my AP physics teacher explained to me that it was because making the transistors smaller allowed the CPU to be smaller, which let you make the whole computer smaller. Even at age 14, I could see that this was an absurd answer, not really different than today's ChatGPT hallucinations — at the time, computers tended to be much larger than they are now, and full of huge amounts of empty space, with the CPU taking up basically no space relative to the amount of space in the box and, on top of that, CPUs were actually getting bigger and not smaller as computers were getting smaller. I asked some other people and didn't really get an answer. This was also relatively early on the life of the public web and I wasn't able to find an answer other than something like "smaller transistors are faster" or "smaller = less capacitance". But why are they faster? And what makes them have less capacitance? Specifically, what about the geometry causes that to scale so that transistors get faster? It's not, in general, obvious that things should get faster if you shrink them, e.g., if you naively linearly shrink a wire, it doesn't appear that it should get faster at all because the cross sectional area is reduced quadratically, increasing resistance per distance quadratically. But length is also reduced linearly, so total resistance is increased linearly. And then capacitance also decreases linearly, so it all cancels out. Anyway, for transistors, it turns out the same kind of straightforward scaling logic shows that they speed up (at back then, transistors were large enough and wire delay was relatively small enough that you got extremely large increases in performance for shrinking transistor). You could explain this to a high school student who's taken physics in a few minutes if you had the right explanation, but I couldn't find an answer to this question until I read a VLSI textbook.

There's now enough content on the web that there must be multiple good explanations out there. Just to check, I used non-naive search terms to find some good results. Let's look at what happens when you use the naive search from above, though.

Google

1. A knowledge card that reads "Smaller transistors can do more calculations without overheating, which makes them more power efficient.", which isn't exactly wrong but also isn't what I'd consider an answer of why. The article is interesting, but is about another topic and doesn't explain why.
2. [Question dropdown], "Why are transistors getting smaller?". Site has an immediate ad pop-over on opening. Site doesn't really answer the question, saying "Since the first integrated circuit was built in the 1950s, silicon transistors have shrunk following Moore’s law, helping pack more of these devices onto microchips to boost their computing power."
3. [Question dropdown] "Why do transistors need to be small?". Answer is "The capacitance between two conductors is a function of their physical size: smaller dimensions mean smaller capacitances. And because smaller capacitances mean higher speed as well as lower power, smaller transistors can be run at higher clock frequencies and dissipate less heat while doing so", which isn't wrong, but the site doesn't explain the scaling that made things faster as transistors got smaller. The page mostly seems concerned about discrete components and note that "In general, passive components like resistors, capacitors and inductors don’t become much better when you make them smaller: in many ways, they become worse. Miniaturizing these components is therefore done mainly just to be able to squeeze them into a smaller volume, and thereby saving PCB space.", so it's really answering a different question
4. [Question dropdown], "Why microchips are getting smaller?". SEO blogspam that doesn't answer the question other than saying stuff like "smaller is faster"
5. [Question dropdown], "Why are microprocessors getting smaller?". Link is to stackexchange. The top answer is that yield is better and cost goes down when chips are smaller, which I consider a non-answer, in that it's also extremely expensive to make things smaller, so what explains why the cost reduction is there? And, also, even if the cost didn't go down, companies would still want smaller transistors for performance reasons, so this misses a major reason and arguably the main reason.
6. "Why are CPU and GPU manufacturers trying to make ...". Top answer is the non-answer of "Smaller transistors are faster and use less power. Small is good." and since it's quora and I'm not a subscriber, the other answers are obscured by a screen that suggests I start a free trial to "access this answer and support the author as a Quora+ subscriber".
7-10. sub-links to other quora answers. Since I'm not a subscriber, by screen real estate, most of the content is ads. None of the content I could read answered the question.

Bing

1. Knowledge card with multiple parts. First parts have some mumbo jumbo, but the last part contains a partial answer. If you click on the last part of the answer, it takes you to a stack exchange question that has more detail on the partial answer. There's enough information in the partial answer to do a search and then find a more complete explanation.
2-4. [people also ask] some answers that are sort of related, but don't directly answer the question
5. Stack exchange answer for a different question.
7-10 [explore further] answers to totally unrelated questions, except for 10, which is extremely ad-laden blogspam to a related question that has a bunch of semi-related text with many ads interspersed between the text.

Kagi

1. "Why does it take multiple years to develop smaller transistors for CPUs and GPUs?", on r/askscience. Some ok comments, but they answer a different question.
2-5. Other reddit links that don't answer the question. Some of them are people asking this question, but the answers are wrong. Some of the links answer different questions and have quite good answers to those questions.
6. Stackexchange question that has incorrect and misleading answers.
7. Stackexchange question, but a different question.
8. Quora question. The answers I can read without being a member don't really answer the question.
9. Quora question. The answers I can read without being a member don't really answer the question.
10. Metafilter question from 2006. The first answers are fundamentally wrong, but one of the later answers links to the wikipedia page on MOSFET. Unfortunately, the link is to the now-removed anchor #MOSFET_scaling. There's still a scaling section which has a poor explanation. There's also a link to the page on Dennard Scaling, which is technically correct but has a very poor explanation. However, someone could search for more information using these terms and get correct information.

Marginalia

No results

Mwmbl

A link to a Vox article titled "Why do artists keep making holiday albums?". This is the only result.

ChatGPT

Has non-answers like "increase performance". Asking ChatGPT to expand on this, with "Please explain the increased performance." results in more non-answers as well as fairly misleading answers, such as

Shorter Interconnects: Smaller transistors result in shorter distances between them. Shorter interconnects lead to lower resistance and capacitance, reducing the time it takes for signals to travel between transistors. Faster signal propagation enhances the overall speed and efficiency of the integrated circuit ... The reduced time it takes for signals to travel between transistors, combined with lower power consumption, allows for higher clock frequencies

I could see this seeming plausible to someone with no knowledge of electrical engineering, but this isn't too different from ChatGPT's explanation of gravity, "... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier."

vancouver snow forecast winter 2023

Good result: Environment Canada's snow forecast, predicting significantly below normal snow (and above normal temperatures)

Google

Knowledge card from a local snow removal company, incorrectly stating "The forecast for the 2023/2024 season suggests that we can expect another winter marked by ample snowfall and temperatures hovering both slightly above and below the freezing mark. Be prepared ahead of time.". On opening the page, we see that the next sentence is "Have Alblaster [the name of the company] ready to handle your snow removal and salting. We have a proactive approach to winter weather so that you, your staff and your customers need not concern yourself with the approaching storms." and the goal of the link is to get you to buy snow removal services regardless of their necessity by writing a fake forecast.
[question dropdown] "What is the winter prediction for Vancouver 2023?", incorrectly saying that it will be "quite snowy".
[question dropdown] "What kind of winter is predicted for 2023 Canada?" Links to a forecast of Ontario's winter, so not only wrong province, but the wrong coast, and also not actually an answer to the question in the dropdown.
[question dropdown] "What is the winter prediction for B.C. in 2023 2024?" Predicts that B.C. will have a wet and mild winter, which isn't wrong, but doesn't really answer the question.
[question dropdown] "What is the prediction for 2023 2024 winter?" Has a prediction for U.S. weather
Blogspam article that has a lot of pointless text with ads all over. Text is contradictory in various ways and doesn't answer the question. Has huge pop-over ad that covers top half the page
Another blogspam article from the same source. Lots of ads; doesn't answer the question
Ad-laden article that answers some related questions, but not this question
Extremely ad-laden article that's almost unreadable due to the number of ads. Talks a lot about El Nino. Eventually notes that we should see below-normal snow in B.C. due to El Nino, but B.C. is almost 100M km² and the forecast is not the same for all of B.C., so you could maybe hope that the comment about B.C. here applies to Vancouver, but this link only lets you guess at the answer
Very ad-laden article, but does have a map which has map that's labeled "winter precipitation" which appears to be about snow and not rain. Map seems quite different from Environment Canada's map, but it does show reduced "winter precipitation" over Vancouver, so you might conclude the right thing from this map.

Bing

1-4. [news carousel] Extremely ad laden articles that don't answer the question. Multiple articles are well over half ads by page area.
5. Some kind of page that appears to have the answer, expect that the data seems to be totally fabricated? There's a graph with day-by-day probability of "winter storm". From when I did the search, there's about an average of about a 50% daily chance of a "snow storm" going forward for the next 2 weeks. Forecasts that don't seem fake have it at 1% or less daily. Page appears to be some kind of SEO'd fake forecast that makes money on ads?
6-8. [more links from same site] Various ad laden pages. One is a "contact us" page where the main "contact us" pane is actually a trick to get you to click on an ad for some kind of monthly payment service that looks like a scam
9-14 [Explore 6 related pages ... recommended to you based on what's popular] Only one link is relevant. That link has a "farmer's almanac" forecast that's fairly different from Environment Canada's forecast. The farmer's almanaic page mainly seems to be an ad to get you to buy farmer's almanic stuff, although it also has conventional ads

Kagi

1. Same SEO'd fake forecast as Bing (5)
2-4. More results from scam weather site
5-7. [News] Irrelevant results
8. Spam article from same site as Google (6)
9-13. More SEO spam from the same site
14. Same fake forecast as Google (1)
15. Page is incorrectly tagged is being from "Dec 25, 2009" (it's a recent page) and doesn't contain relevant results

Marginalia

No results.

Mwmbl

1. Ad-laden news article from 2022 about a power outage. Has an autoplay video ad and many other ads as well.
2. 2021 article about how the snow forecast for Philadelphia was incorrect. Article has a slow-loading full-page pop-over that shows up after a few seconds and is full of ads.
3. 2016 article on when the Ohio river last froze over.
4. Some local news site from Oregon with a Feb 2023 article on the snow forecast at the time. Site has an autoplay video ad and is full of other ads. Clicking one of the random ads ("Amazon Hates When You Do Ths, But They Can't Stop You (It's Genius)" results in the ad trying to get you to install a chrome extension. The ad attempts to resemble an organic blog post on a site that's just trying to get you to save money, but if you try to navigate away from the "blog post", you get a full page popover that tries to trick you into installing the chrome extension. Going to the base URL reveals that the entire site is actually a site that's trying to trick users into installing this chrome extension. This is the last result.

ChatGPT

"What is the snow forecast for Vancouver in winter of 2023?"

Doesn't answer questions, recommends using a website, app, or weather service.

Asking "Could you please direct me to a weather website, app, or weather service that has the forecast?" causes ChatGPT to return random weather websites that don't have a seasonal snow forecast.

I retried a few times. One time, I accidentally pasted in the entire ChatGPT question, which meant that my question was prepened with "User\n". That time, ChatGPT suggested "the Canadian Meteorological Centre, Environment Canada, or other reputable weather websites". The top response when asking for the correct website was "Environment Canada Weather", which at least has a reasonable seeming seasonal snow forecast somewhere on the website. The other links were still to sites that aren't relevant.

Appendix: Google "knowledge card" results

In general, I've found Google knowledge card results to be quite poor, both for specific questions with easily findable answers as well as for silly questions like "when was running invented" which, for years, infamously returned "1748. Running was invented by Thomas Running when he tried to walk twice at the same time" (which was pulled from a Quora answer).

I had a doc where I was collecting every single knowledge card I saw to tabulate the fraction that were correct. I don't know that I'll ever turn that into a post, so here are some "random" queries with their knowledge card result (and, if anyone is curious, most knowledge card results I saw when I was tracking this were incorrect).

"oc2 gemini length" (looking for the length of a kind of canoe, an oc2, called a gemini)
- 20″ (this was the length of a baby mentioned in an article that also mentioned the length of the boat, which is 24'7"
"busy beaver number"
- (604) 375-2754
"Feedly revenue"
- "$5.2M/yr", via a link to a site which appears to just completely fabricate revenue and profit estimates for private companies
"What airlines fly direct from JFK airport to BLI airport?"
- "Alaska Airlines - (AS) with 30 direct flights between New York and Bellingham monthly; Delta Air Lines - (DL) with 30 direct flights between JFK and BLI monthly". This sounded plausible, but when I looked this up, this was incorrect. The page it links to has a bunch of text that like "How many morning flights are there from JFK to BLI? Alaska Airlines - (AS) lists, on average, 1 flights departing before 12:00pm, where the first departure from JFK is at 09:30AM and the last departure before noon is at 09:30AM", seemingly with the goal of generating a knowledge card for questions like this. It doesn't really matter that the answers are fabricated since the goal of the site seems to be to get traffic or visibility via knowledge cards
"Air Canada Vancouver Newark"
- At the time I did this search, this showed a knowledge card indicating that AC 7082 was going to depart the next day at 11:50am, but no such flight had existed for months and there was certainly not an AC 7082 flight about to depart the next day
"TYR Hurricane Category 5 neoprene thickness"
- 1.5mm (this is incorrect)
"Intel number of engineers"
- (604) 742-3501 (I was looking for the number of engineers that Intel employed, not a phone number, and even if I was looking for a phone number for Intel engineers, I don't think this is it).
"boston up118s dimensions"
- "5826298 x 5826899 x 582697 in" (this is a piano and, no, it is not 92 miles long)
"number of competitive checkers players"
- 2
"fraser river current speed"
- "97 to 129 kilometers per hour (60 to 80 mph)" (this is incorrect)
"futura c-4 surfski weight"
- "39 pounds" (this is actually the weight of a different surfski; the article this comes from just happens to also mention the futura c-4)

Appendix: FAQ

As already noted, the most common responses I get are generally things that are explicitly covered in the post, so I won't recover those here. However, any time I write a post that looks at anything, I also get a slew of comments like and, indeed, that was one of the first comments I got on this post.

This isn't a peer-reviewed study, it's crap

As I noted in this other post,

There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.

When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my ability) even if it makes the result look less interesting or less likely to win an award.

The same thing applies here and, in fact, I have a best paper award in this field (information retrieval, or IR, colloquially called search). I don't find IR papers particularly rigorous. I did push very hard to make my top-conference best-paper-award-wining paper more rigorous and, while I won some of those fights, I lost others, and that paper has a number of issues that I wouldn't let pass in a blog post. I suspect that people who make comments like this mostly don't read papers and, to the extent they do, don't understand them.

Another common response is

Your table is wrong. I tried these queries on Kagi and got Good results for the queries [but phrase much more strongly]

I'm not sure why people feel so strongly about Kagi but, all of these kinds of responses so far have come from Kagi users. No one has gotten good results for the tire, transistor, or snow queries (note, again, that this is not a query looking for a daily forecast, as clearly implied by the "winter 2023" in the query), nor are the results for the other queries very good if you don't have an ad blocker. I suppose it's possible that the next person who tells me this actually has good results, but that seems fairly unlikely given the zero percent correctness rate so far.

For example, one user claimed that the results were all good, but they pinned GitHub results and only ran the queries for which you'd get a good result on GitHub. This is actually worse than you get if you use Google or Bing and write good queries since you'll get noise in your results when GitHub is the wrong place to search. Of course you make a similar claim that Bing is amazing is you write non-naive queries, so it's curious that so many Kagi users are angrily writing me about this and no Google or Bing users. Kagi appears to have tapped into the same vein that Tesla and Apple have managed to tap into, where users become incensed that someone is criticizing something they love and then write nonsensical defenses of their favorite product, which bodes well for Kagi. I've gotten comments like this from not just one Kagi user, but many.

this person does go on to say ", but it is true that a lot of, like, tech industry/trade stuff has been overwhelmed by LLM-generated garbage". However, the results we see in this post generally seem to be non-LLM generated text, often pages pre-dating LLMs and low quality results don't seem confined to or even particularly bad in tech-related areas. Or, to pick another example, our bluesky thought leader is in a local Portland band. If I search "[band name] members", I get a knowledge card which reads "[different band name] is a UK indie rock band formed in Glastonbury, Somerset. The band is composed of [names and instruments]." ^[return]
For example, for a youtube downloader, my go-to would be to search HN, which returns reasonable results. Although that works, if it didn't, my next step would be to search reddit (but not using reddit search, of course), which returns a mix of good and bad results; searching for info about each result shows that the 2nd returned result (yt-dlp) is good and most of the other results are quite bad. Other people have different ways of getting good results, e.g., Laurence Tratt's reflex is to search for "youtube downloader cli" and Heath Borders's is to search for "YouTube Downloader GitHub"; both of those searches work decently as well. If you're someone whose bag of tricks includes the right contortions to get good results for almost any search, it's easy to not realize that most users don't actually know how to do this. From having watched non-expert users try to use computers with advice from expert users, it's clear that many sophisticated users severely underestimate how much knowledge they have. For example, I've heard many programmers say that they're good at using computers because "I just click on random things to see what happens". Maybe so, but when they give this advice to naive users, this generally doesn't go well and the naive users will click on the wrong random things. The expert user is not, in fact, just clicking on things at random; they're using their mental model of what clicks might make sense to try clicks that could make sense. Similarly with search, where people will give semi-plausible sounding advice like "just add site:reddit.com to queries". But adding "site:reddit.com" that makes many queries worse instead of better — you have to have a mental model of which queries this works on and which queries this fails on.

When people have some kind of algorithm that they consistently use, it's often one that has poor results that is also very surprising to technical folks. For example, Misha Yagudin noted, "I recently talked to some Russian emigrates in Capetown (two couples have travel agencies, and another couple does RUB<>USDT<>USD). They were surprised I am not on social media, and I discovered that people use Instagram (!!) instead of Google to find products and services these days. The recipe is to search for something you want 'triathlon equipment,' click around a bit, then over the next few days you will get a bunch of recommendations, and by clicking a bit more you will get even better recommendations. This was wild to me."
^[return]
she did better than naive computer users, but still had a lot of holes in her mental model that would lead to installing malware on her machine. For what it's like for normal computer users, the internet is full of stories from programmers like "The number of times I had to yell at family members to NOT CLICK THAT ITS AN AD is maddening. It required getting a pretty nasty virus and a complete wipe to actually convince my dad to install adblock.". The internet is full of scam ads that outrank search that install malware and a decent fraction of users are on devices that have been owned by clicking on an ad or malicious SEO'd search result and you have to constantly watch most users if you want to stop their device from being owned. ^[return]
accidentally prepending "User\n" to one query got it to return a good result instead of bad results, reminiscent of how ChatGPT "thought" Colin Percival was dead if you asked it to "write about" him, but alive if you asked it to "Write about" him. It's already commonplace for search ranking to be done with multiple levels of ranking, so perhaps you could get good results by running randomly perturbed queries and using a 2nd level ranker, or ChatGPT could even have something like this built in. ^[return]
some time after Google stopped returning every tweet I wanted to find, Twitter search worked well enough that I could find tweets with Twitter search. However, post-acquisition, Twitter search often doesn't work in various ways. For maybe 3-5 months, search didn't return any of my tweets at all. And both before and after that period, searches often fail to return a tweet even when I search for an exact substring of a tweet, so now I often have to resort to various weird searches for things that I expect to link to the tweet I'm looking for so I can manually follow the link to get to the tweet. ^[return]

Transcript of Elon Musk on stage with Dave Chapelle

Sun, 11 Dec 2022 00:00:00 +0000

This is a transcription of videos Elon Musk's appearance on stage with Dave Chapelle using OpenAI's Whisper model with some manual error corrections and annotations for crowd noise.

As with the Exhibit H Twitter text message release, there are a lot of articles that quote bits of this, but the articles generally missing a lot of what happened and often paint a misleading picture of happened and the entire thing is short enough that you might as well watch or read it instead of reading someone's misleading summary. In general, the media seems to want to paint a highly unflattering picture of Elon, resulting in articles and virtual tweets that are factually incorrect. For example, it's been widely incorrectly reported that, during the "I'm rich, bitch" part, horns were played to drown out the crowd's booing of Elon, but the horn sounds were played when the previous person said the same thing, which was the most cheered statement that was recorded. The sounds are much weaker when Elon says "I'm rich, bitch" and can't be heard clearly, but it sounds like a mix of booing and cheering. It was probably the most positive crowd response that Elon got from anything and it seems inaccurate in at least two ways to say that horns were played to drown out the booing Elon was receiving. On the other hand, even though the media has tried to paint as negative a picture of Elon as possible, it's done quite a poor job and a boring, accurate, accounting of what happened in many of other sections are much less flattering than the misleading summaries that are being passed around.

Video 1
- Dave: Ladies and gentlemen, make some noise for the richest man in the world.
- Crowd: [mixed cheering, clapping, and boos; boos drown out cheering and clapping after a couple of seconds and continue into next statements]
- Dave: Cheers and boos, I say
- Crowd: [brief laugh, boos continue to drown out other crowd noise]
- Dave: Elon
- Crowd: [booing continues]
- Elon: Hey Dave
- Crowd: [booing intensifies]
- Elon: [unintelligible over booing]
- Dave: Controversy, buddy.
- Crowd: [booing continues; some cheering can be heard]
- Elon: Weren't expecting this, were ya?
- Dave: It sounds like some of them people you fired are in the audience.
- Crowd: [laughs, some clapping can be heard]
- Elon: [laughs]
- Crowd: [booing resumes]
- Dave: Hey, wait a minute. Those of you booing
- Crowd: [booing intensifies]
- Dave: Tough [unintelligible due to booing] sounds like
- Elon: [unintelligible due to being immediately cut off by Dave]
- Dave: You know there's one thing. All those people are booing. I'm just. I'm just pointing out the obvious. They have terrible seats. [unintelligible due to crowd noise]
- Crowd: [weak laughter]
- Dave: All coming from wayyy up there [unintelligible] last minute non-[unintelligible] n*****. Booo. Booooooo.
- Crowd: [quiets down]
- Dave: Listen.
- Crowd: [booing resumes]
- Dave: Whatever. Look motherfuckas. This n**** is not even trying to die on earth
- Crowd: [laughter mixed with booing, laughter louder than boos]

Video 2
- Dave: His whole business model is fuck earth I'm leaving anyway
- Crowd: [weak laughter, weak mixed sounds]
- Dave: Do all you want. Take me with you n**** I'm going to Mars
- Crowd: [laughter]
- Dave: Whatever kind of pussy they got up there, that's what we'll be doin
- Crowd: [weak laughter]
- Dave: [laughs] Anti-gravity titty bars. Follow your dreams bitch and the money just flow all over the room
- Crowd: [weak laughter]
- Elon: [laughs]
- Crowd: [continued laughter drowned out by resumed booing; some cheering can be heard]
- Elon: Thanks for, uhh, thanks for having me on stage.
- Dave: Are you kidding. I wouldn't miss this opportunity.
- Elon: [unintelligible, cut off by crowd laughter]
- Crowd: [laughter]
- Elon: [unintelligible, cut off by crowd laughter]
- Dave: The first comedy club on Mars that should be my [pause for crowd laughter] a deal's a deal, Musk.
- Crowd: [weak laughter and cheering]
- Elon: [unintelligible], yeah
- Dave: You n***** can boo all you want. This n**** gave me a jet pack last Christmas
- Crowd: [laughter]
- Dave: Fly right past your house. They can boo these nuts [unintelligible due to laughter at this line]
- Dave: That's how we like to chill, we do all the shit
- Crowd: [weak laughter, shifting to crowd talking]
- Elon: [Elon shifts, as if to address crowd]
- Crowd: [booing resumes]
- Elon: Dave, what should I say?
- Crowd: [booing intensifies]
- Dave: Don't say nothin. It'll only spoil the moment. Do you hear that sound Elon? That's the sound of pending civil unrest.
- Crowd: [weak laughter, some booing can initially be heard; booing intensifies until Dave cuts it off with his next line]
- Dave: I can't wait to see which story you decimate next motherfucka [unintelligible] you shut the fuck up with your boos. There's something better that you can do. Booing is not the best thing that you can do. Try it n****. Make it what you want it to be. I am your ally. I wish everybody in this auditorium peace and the joy of feeling free and your pursuit of happiness make you happy. Amen. Thank you very much San Francisco. No city on earth has ever been kind to me. Thank you. Good night.

Video 3 [lots of empty seats in the crowd at this point]
- Dave: [unintelligible] as you can. It's funnier when you say it. Are you ready? Say this [unintelligible] you say. Go ahead.
- Crowd: [weak laugther]
- Maybe Chris Rock?: I'm rich bitch
- Crowd: [loud cheers, loud horn from stage can be heard as well]
- Unknown: Wait wait wait wait [hands mic to Elon]
- Crowd: [laughter]
- Elon: [poses]
- Crowd: [laughter, booing starts to be heard over laughter]
- Elon: I'm rich bitch
- Crowd: [some sound, hard to hear over horns from stage followed by music from the DJ drowning out the crowd; sounds like some booing and some cheering]

Video 4
- Dave: Talib Kweli my good friend [crowd cheers] is currently banned from Twitter.
- Crowd: [laughter]
- Dave: He goes home to [unintelligible], Kweli. [hands mic to Elon]
- Elon: Ahh waa. Twitter cu-customer service right here.
- Crowd: [weak laughter]
- Elon: We'll get right on that.
- Crowd: [weak booing, gets stronger over time through next statement, until cut off by Dave]
- Elon: Dave, you should be on Twitter.
- Dave: If you. Let me tell you something. Wait. Radio, where's your phone?
- Dave: Listen. Years ago, this is true, I'll tell you two quick Twitter stories then we'll go home.
- Crowd: [weak laughter]
- Dave: Years ago, I went to love on the Twitter. I put my name in, and it said that you can't use famous people's names.
- Crowd: [weak laughter]
- Dave: And that my name was already in use, it's true.
- Dave: So I look online to see who's using my name and it turns out it was a fake Dave Chappelle. And I was like, what the fuck? And I started to shut him down, but I read the n***** tweets. And this is shocking. This motherfucker, Elon, was hilarious.
- Crowd: [weak laughter, someone yells out "damn right"]
- Dave: So I figured, you know what, I'm gonna let him drop. And everybody will think I'm saying all this funny shit, and I don't even have to say this stuff. And it was great. Every morning I wake up and get some coffee and laugh at fake Dave Chappelle's tweets.
- Dave: But then
- Crowd: [loud sounds, can hear someone say "whoa"]
- Dave: [blocks stage light with hand so he can see into the crowd, looks into crowd] Fight. Will you cut that shit out, you anti-[unintelligible; lots of people are reporting this as facist, which is plausible, making the statement about "anti-facists"] n*****?
- Crowd: [loud sounds, can hear some jeers and boos]

Chat log exhibits from Twitter v. Musk case

Sat, 01 Oct 2022 00:00:00 +0000

This is a scan/OCR of Exhibits H and J from the Twitter v. Musk case, with some of the conversations de-interleaved and of course converted from a fuzzy scan to text to make for easier reading.

I did this so that I could easily read this and, after reading it, I've found that most accountings of what was said are, in one way or another, fairly misleading. Since the texts aren't all that long, if you're interested in what they said, I would recommended that you just read the texts in their entirety (to the extent they're available — the texts make it clear that some parts of conversations are simply not included) instead of reading what various journalists excerpted, which seems to sometimes be deliberately misleading because selectively quoting allows them to write a story that matches their agenda and sometimes accidentally misleading because they don't know what's interesting about the texts.

If you want to compare these conversations to other executive / leadership conversations, you can compare them to Microsoft emails and memos that came out of the DoJ case against Microsoft and the Enron email dataset.

Since this was done using OCR, it's likely there are OCR errors. Please feel free to contact me if you see an error.

Exhibit H

2022-01-21 to 2022-01-24
- Alex Shillings [IT specialist for SpaceX / Elon]: Elon- are you able to access your Twitter account ok? I saw a number of emails held in spam. Including some password resets attempts
- Elon: I haven't tried recently
- Elon: Am staying off twitter
- Elon: Is my twitter account posting anything?
- Alex: Not posting but I see one deactivation email and a dozen password reset emails. Assuming this is a scammer attempt but wanted to check to ensure you still had access to your Twitter
- Elon: It Is someone trying to hack my twitter
- Elon: But I have two-factor enabled with the confirmation app
- Alex: OK, great to hear.
- Alex: Yes -FaceTimed with them to confirm my identity(hah) and they are hopefully gonna reset your 2FA to SMS soon. Asking for an update now
- Elon: Sounds good
- Elon: I can also FaceTime with them if still a problem
- Alex: Tldr; your account is considered high profile internally over there. So they've made it very hard to make changes like this by their teams. They are working through it...
- Elon: Happy to FaceTime directly
- Elon: Not sure how I was able to make Twitter work on this new phone, as I didn't use the backup code.
- Alex: Connecting with their head of Trust & Safety now
- Alex: I assume we used your old phone to verify the new, once upon a time
- Elon: Oh yeah
- Alex: They can fix it by disabling all 2FA for your account which will let you in and then you can re-enable it. Are you available in 90 mins to have them coordinate it?
- Elon: ["liked" above]
- Alex: I know things are in flux right now, but is EMDesk SpaceX still your primary calendar? I realize there a meeting on there in 1 hour. In case I should move this twitter fix out a bit.
- Elon: Yeah
- Elon: But I can step off the call briefly ta Face Time them if need be
- Alex: Sounds good. And ideally I'm just texting you ta sign in once they disable 2FA and then you can immediately sign in and re-enable. No FaceTime needed.
- Elon: ["liked" above]
- Alex: Elon-we are ready to make the change if you are
- Elon: ["liked" above]
- Alex: 2FA disabled. Please try to log in now
- Alex: Able to get back in ok?
- Elon: ["liked" above]
- Alex: And once in you can enable 2FA Settings> Security and account access> Security> 2FA Alex Stillings
- Alex: App only is suggested
- Elon: Thanks!
- Alex: And reminder to save that backup code 👍
- Elon: ["liked" above]

2022-03-05
- Antonio Gracias [VC]: Wow...I saw your tweet re free speech. Wtf is going on Elon...
- Elon: EU passed a law banning Russia Today and several other Russian news sources. We have been told to block their IP address.
- Elon: Actually, I find their news quite entertaining
- Elon: Lot of bullshit, but some good points too
- Antonio: This is fucking nuts...you are totally right. I 100% agree with you.
- Elon: We should allow it precisely bc we hate it...that is the ping of the American constitution.
- Antonio: Exactly
- Elon: Free speech matters mast when it's someone you hate spouting what you think is bullshit.
- Antonio: I am 100% with you Elon. To the fucking mattresses no matter what .....this is a principle we need to fucking defend with our lives or we are lost to the darkness.
- Antonio: Sorry for the swearing. I am getting excited.
- Elon: ["loved" "I am 100%..."]
- Elon: [2022-04-26] On a call. Free in 3O mins.
- Antonio: Ok. I'll call you in 30

2022-03-24
- TJ: can you buy Twitter and then delete it, please!? xx
- TJ: America is going INSANE.
- TJ: The Babylon Bee got suspension is crazy. Raiyah and I were talking about it today. It was a fucking joke. Why has everyone become so puritanical?
- TJ: Or can you buy Twitter and make it radically free-speech?
- TJ: So much stupidity comes from Twitter xx
- Elon: Maybe buy it and change it to properly support free speech xx
- Elon: ["liked" "Or can you buy Twitter..."]
- TJ: I honestly think social media is the scourge of modern life, and the worst of all is Twitter, because it's also a news stream as well as a social platform, and so has more real-world standing than Tik Tok etc. But it's very easy to exploit and is being used by radicals for social engineering on a massive scale. And this shit is infecting the world. Please do do something to fight woke-ism. I will do anything to help! xx

2022-03-24 to 2022-04-06 [interleaved with above convo]
- Joe Lonsdale [VC]: I love your "Twitter algorithm should be open source" tweet -I'm actually speaking to over 100 members of congress tomorrow at the GOP policy retreat and this is one of the ideas I'm pushing for reigning in crazy big tech. Now I can cite you so I'll sound less crazy myself :). Our public squares need to not have arbitrary sketchy censorship.
- Elon: ["liked" above]
- Elon: Absolutely
- Elon: What we have right now is hidden corruption!
- Joe: ["loved" above]
- [2022-04-04]: Joe: Excited to see the stake in Twitter -awesome. "Back door man" they are saying haha. Hope you're able to influence it. I bet you the board doesn't even get full reporting or see any report of the censorship decisions and little cabals going on there but they should -the lefties on the board likely want plausible deniability !
- Elon: ["liked" above]
- [2022-04-16] Joe: Haha even Governor DeSantis just called me just now with ideas how to help you and outraged at that board and saying the public is rooting for you. Let me know if you or somebody on your side wants to chat w him. Would be fun to see you if you guys are around this weekend or the next few days.
- Elon: Haha cool

2022-03-26
- "jack jack" [presumably Jack Dorsey, former CEO of Twitter and CEO of Square]: Yes, a new platform is needed. It can't be a company. This is why I left.
- jack: https://twitter.com/elonmusk/status/1507777913042571267?s,=20&t=8z3h0h0JGSnt86Zuxd61Wg
- Elon: Ok
- Elon: What should it look like?
- jack: I believe it must be an open source protocol, funded by a foundation of sorts that doesn't own the protocol, only advances it. A bit like what Signal has done. It can't have an advertising model. Otherwise you have surface area that governments and advertisers will try to influence and control. If it has a centralized entity behind it, it will be attacked. This isn't complicated work, it just has to be done right so it's resilient to what has happened to twitter.
- Elon: Super interesting idea
- jack: I'm off the twitter board mid May and then completely out of company. I intend to do this work and fix our mistakes. Twitter started as a protocol. It should have never been a company. That was the original sin.
- Elon: I'd like to help if I am able to
- jack: I wanted to talk with you about it after I was all clear, because you care so much, get it's importance, and could def help in immeasurable ways. Back when we had the activist come in, I tried my hardest to get you on our board, and our board said no. That's about the time I decided I needed to work to leave, as hard as it was for me.
- Elon: ["loved" above]
- jack: Do you have a moment to talk?
- Elon: Bout to head out to dinner but can for a minute
- jack: I think the main reason is the board is just super risk averse and saw adding you as more risk, which I thought was completely stupid and backwards, but I only had one vote, and 3% of company, and no dual class shares. Hard set up. We can discuss more.
- Elon: Let's definitely discuss more
- Elon: I think it's worth both trying to move Twitter in a better direction and doing something new that's decentralized
- jack: It's likely the best option. I just have doubts. But open
- Elon: ["liked above]

2022-03-26 to 2022-03-27
- Elon to Egon Durban [private equity; Twitter board member]: This is Elon. Please call when you have a moment.
- Elon: It is regarding the Twitter board.
- Egon: Have follow-up. Let's chat today whenever convenient for you.

2022-03-27 to 2022-04-26 [interleaved with above]
- Larry Ellison [Oracle founder and exec] Elon, I'd like to chat with you in the next day or so ... I do think we need another Twitter 👍
- Elon: Want to talk now?
- Larry: Sure.
- [2022-04-17] Elon: Any interest in participating in the Twitter deal?
- Larry: Yes ... of course 👍
- Elon: Cool
- Elon: Roughly what dollar size? Not holding you to anything, but the deal is oversubscribed, so I have to reduce or kick out some participants.
- Larry: A billion ... or whatever you recommend
- Elon: Whatever works for you. I'd recommend maybe $2B or more. This has very high potential and I'd rather have you than anyone else.
- Larry: I agree that it has huge potential... and it would be lots of fun
- Elon: Absolutely:)
- [2022-04-26] Larry: Since you think I should come in for at least $2B... I'm in for $2B 👍
- Elon: Haha thankss:)

2022-03-27 to 2022-03-31 [group chat with Egon Durban, "Martha Twitter NomGov", Brett Taylor [CEO of Salesforce and Chairman of Twitter board], "Parag" [presumably Parag Agrawal, CEO of Twitter, and Elon Musk]
- Egon: Hi everyone Parag (Ceo), Bret (Chairman) and Martha (head of gov) -You are connected w Elon. He is briefed on my conversations w you. Elon -everyone excited about prospect of you being involved and on board. Next step is for you to chat w three of them so we can move this forward quickly. Maybe we can get this done next few days🤞
- Elon: Thanks Egon
- Parag: Hey Elon - great to be connected directly. Would love to chat! Parag
- Martha: Hey Elon, I'm Martha chair of Twitter nomgov- know you've talked to Bret and parag - keen to have a chat when you have time - im in Europe (also hope covid not too horrible as I hear you have it)
- Parag: Look forward to meeting soon! Can you let us know when you are able to meet in the Bay Area in the next couple of days?
- Martha: Hey Elon, I'm Martha chair of Twitter nomgov - know you've talked to Bret and parag -I'm v keen to have a chat when you have time - im in Europe but will make anything work
- Elon: Sounds good. Perhaps a call late tonight central time works? I'm usually up until ~3am.
- Martha: If ok with you I'll try you 10am CET (lam PST) looking forward to meeting you
- Elon: Sure
- Martha: Thanks v much for the time Elon -pis let us know who in your office our GC can talk to -sleep well!
- Elon: You're most welcome. Great to talk!

2022-03-27 [interleaved with above]
- Brett Taylor: This is Bret Taylor. Let me know when you have a minute to speak today.Just got off with Parag and I know he is eager to speak with you today as well. Flexible all day
- Elon: Later tonight would work - maybe 7pm? I have a minor case of Covid, so am a little under the weather.
- Brett: Sorry to hear -it can knock you out. 7pm sounds great
- Elon: ["liked above]

2022-03-27
- Parag: Would love to talk. Please let me know what time works - I'm super flexible. -Parag
- Elon: Perhaps tonight around 8?
- Parag: That works! Look forward to talking.
- Elon: ["liked" above]
- Elon: Just finishing a Tesla Autopilot engineering call
- Parag: ["liked" above]

2022-03-27 to 2022-04-24 [interleaved with above]
- "Dr Jabour": Hi E,Pain settling down? Time for a latter-day Guttenburg to bring back free speech ..... and buy Twitter.
- [2022-04-04] Elon: ["liked" above]
- [2022-04-24] Jabour: Hi E,looks like a TWITTER board member scurrying around unbalanced trying to deal with your offer.... Am loving your tactics, ( vid taken fro my house on Monica beach)-Brad
- Elon: ["loved" above]

2022-03-29 to 2022-04-01 [interleaved with above]
- Will MacAskill [co-creator of effective altruism movement, Oxford professor, Chair of board for Global Priorities Institute at Oxford]: Hey - I saw your poll on twitter about Twitter and free speech. I'm not sure if this is what's on your mind, but my collaborator Sam Bankman-Fried (https://www.forbes.com/profile/sam-bankman-fried/?sh=4de9866a4449) has for a while been potentially interested in purchasing it and then making it better for the world. If you want to talk with him about a possible joint effort in that direction, his number is [redacted] and he's on Signal.
- Elon: Does he have huge amounts of money?
- Will: Depends on how you define "huge"! He's worth $24B, and his early employees (with shared values) bump that to $30B. I asked about how much he could in principle contribute and he said: "~$1-3b would be easy-$3-8b I could do ~$8-15b is maybe possible but would require financing"
- Will: If you were interested to discuss the idea I asked and he said he'd be down to meet you in Austin
- Will: He's based in the Bahamas normally. And I might visit Austin next week, if you'd be around?
- Will: That's a start
- Will: Would you like me to intro you two via text?
- Elon: You vouch for him?
- Will: Very much so! Very dedicated to making the long-term future of humanity go well
- Elon: Ok then sure
- Will: Great! Will use Signal
- Will: (Signal doesn't work; used imessage instead)
- Elon: Ok
- Will: And in case you want to get a feel for Sam, here's the Apr 1st tweet from his foundation, the Future Fund, which I'm advising on -I thought you might like it:
- Will: https://twitter.com/ftxfuturefund/status/1509924452422717440?s=20&t=0qjM58KUj49xSGa0qae97Q
- Will: And here's the actual (more informative) launch tweet· moving $100M-$1B this year to improve the future of humanity:
- Will: https://twitter.com/ftxfuturefund/status/1498350483206860801

2022-03-29 to 2022-04-14
- Mathias Döpfner [CEO and 22% owner of Axel Springer, president of Federal Association of Digital Publishers and Newspaper Publishers]: Why don't you buy Twitter? We run it for you. And establish a true platform of free speech. Would be a real contribution to democracy.
- Elon: Interesting idea
- Mathias: I'm serious. It's doable. Will be fun.
- [2022-04-04] Mathias: Congrats to the Twitter invest! Fast execution 🤩 Shall we discuss wether we should join that project? I was serious with my suggestion.
- Elon: Sure, happy to talk
- Mathias: I am going to miami tomorrow for a week. Shall we speak then or Wednesday and take it from there?
- Elon: Sure
- [2022-04-06] Mathias: A short call about Twitter?
- Mathias: # Status Quo: It is the de facto public town square, but it is a problem that it does not adhere to free speech principles. => so the core product is pretty good, but (i) it does not serve democracy, and (ii) the current business model is a dead end as reflected by flat share price. # Goal: Make Twitter the global backbone of free speech, an open market place of ideas that truly complies with the spirit of the first amendment and shift the business model to a combination of ad-supported and paid to support quality # Game Plan: 1.),,Solve Free Speech" 1a) Step 1: Make it censorship-FREE by radically reducing Terms of Services (now hundreds of pages) to the following: Twitter users agree to: (1) Use our service to send spam or scam users, (2) Promote violence, (3) Post illegal pornography. 🙃 1b) Step 2: Make Twitter censorship-RESISTANT • Ensure censorship resistance by implementing measures that warrant that Twitter can't be censored long term, regardless of which government and management • How? Keep pushing projects at Twitter that have been working on developing a decentralized social network protocol (e.g., BlueSky). It's not easy, but the backend must run on decentralized infrastructure, APls should become open (back to the roots! Twitter started and became big with open APIs). • Twitter would be one of many clients to post and consume content. • Then create a marketplace for algorithms, e.g., if you're a snowflake and don't want content that offends you pick another algorithm. 2.) ,,Solve Share Price" Current state of the business: • Twitters ad revenues grow steadily and for the time being, are sufficient to fund operations. • MAUs are flat, no structural growth • Share price is flat, no confidence in the existing business model and/or
- [2022-04-14] Mathias: Our editor of Die Welt just gave an interview why he left Twitter. What he is criticising is exactly what you most likely want to change. I am thrilled to discuss twitters future when you are ready. So exciting.
- Elon: Interesting!

2022-03-31 to 2022-04-01 [group chat with Bret Taylor, Parag, and Elon Musk, interleaved with some of the above]
- Elon: I land in San Jose tomorrow around 2pm and depart around midnight. My Tesla meetings are flexible, so I can meet anytime in those 10 hours.
- Bret: By "tomorrow" do you mean Thursday or Friday?
- Elon: Today
- Parag: I can make any time in those 10 hours work.
- Bret: I land in Oakland at 8:30pm. Perhaps we can meet at 9:30pm somewhere? I am working to see if I can move up my flight from NYC to land earlier in the meantime
- Bret: Working on landing earlier and landing in San Jose so we can have dinner near you. Will keep you both posted in real time
- Bret: Ok, successfully moved my flight to land at 6:30pm in San Jose. Working on a place we can meet privately
- Elon: Sounds good
- Elon: Crypto spam on Twitter really needs to get crushed. It's a major blight on the user experience and they scam so many innocent people.
- Bret: It sounds like we are confirming 7pm at a private residence near San Jose. Our assistants reached out to Jehn on logistics. Let me know if either of you have any concerns or want to move things around. Looking forward to our conversation.
- Parag: Works for me. Excited to see you both in person!
- Elon: Jehn had a baby and I decided to try having no assistant for a few months
- Elon: Likewise
- Bret: The address is [redacted]
- Bret: Does 7pm work for you Elon?
- Elon: Probably close to that time. Might only be able to get there by 7:30, but will try for earlier.
- Bret: Sounds good. I am going to be a bit early because my plane is landing earlier but free all evening so we can start whenever you get there and Parag and I can catch up in the meantime
- Bret: This wins for the weirdest place I've had a meeting recently. I think they were looking for an airbnb near the airport and there are tractors and donkeys 🤷
- Elon: Haha awesome
- Elon: Maybe Airbnb's algorithm thinks you love tractors and donkeys (who doesn't!)
- Elon: On my way. There in about 15 mins.
- Bret: And abandoned trucks in case we want to start a catering business after we meet
- Elon: Sounds like a post-apocalyptic movie set
- Bret: Basically yes
- Elon: Great dinner:)
- Bret: Really great. The donkeys and dystopian surveillance helicopters added to the ambiance
- Elon: Definitely one for the memory books haha
- Parag: Memorable for multiple reasons. Really enjoyed it

2022-03-31 to 2022-04-02 [group message with Will MacAskill, "Sam BF", and Elon Musk, interleaved with above]
- Will: Hey, here's introducing you both, Sam and Elon. You both have interests in games, making the very long-run future go well, and buying Twitter. So I think you'd have a good conversation!
- Sam: Great to meet you Elon-happy to chat about Twitter (or other things) whenever!
- Elon: Hi!
- Elon: Maybe we can talk later today? I'm in Germany.
- Sam: I'm on EST-could talk sometime between 7pm and 10pm Germany time today?

2022-04-03 to 2022-04-04 [group chat with Jared Birchall, "Martha Twitter NomGov", and Elon Musk]
- Elon: Connecting Martha (Twitter Norn/Gov) with Jared (runs my family office).
- Elon: Jared, there is important paperwork to be done to allow for me to hopefully join the Twitter board.
- Martha: Thanks Elon - appreciate this - hi Jared - I'm going to put Sean Edgett in touch with you who is GC at Twitter
- Jared: Sounds good. Please have him call anytime or send the docs to my email: [...]
- Martha: 👍
- Martha: Elon - are you available to chat for 5 mins?
- Martha: I'd like to relay the board we just finished
- Elon: Sure
- [2022-04-04] Martha: Morning elon -you woke up to quite a storm.... Great to hear from Bret that you agree we can move this along v quickly today -Jared, I'm assuming it's you I should send the standstill they discussed to you? It will be the same as egon and silverlake undertook. Let me know if should go to someone else - we're really keen to get this done in next couple of hours. Thank you
- Elon: You can send to both of us
- Elon: Sorry, I just woke up when Bret called! I arrived from Berlin around 4am.
- Martha: No apologies necessary. Let's How would you like it sent? If by email, pis let me know where
- Elon: Text or email
- Elon: My email is [redacted]
- Martha: 👍
- Martha: <Attachment-application/vnd.openxmlformatsofficedocument.wordprocessingml.document-Twitter Cooperation Agreement-Draft April 4 2022.docx>
- Martha: Here it is -also gone by email Same as Egon's but even more pared back
- Martha: Just copying you both to confirm sent agreement-v keen to get this done quickly as per your conversation

2022-04-03
- Bret: Just spoke to Martha. Let me know when you have time to talk today or tomorrow. Sounds like you are about to get on a flight — flexible
- Elon: Sounds good. I'm just about to take off from Berlin to Austin, but free to talk anytime tomorrow.
- Bret: I am free all day. Text when you are available. Planning to take a hike with my wife and that is the only part where my reception may be spotty. Looking forward to speaking. And looking forward to working with you!
- Elon: ["liked" above]

2022-04-03 [interleaved with above]
- Parag: I expect you heard from Martha and Bret already. I'm super excited about the opportunity and look forward to working closely and finding ways to use your time as effectively as possible to improve Twitter and the public conversation.
- Elon: Sounds great!

2022-04-03
- jack: I heard good things are happening
- Elon: ["liked" above]

2022-04-04
- Ken Griffin [CEO of Citadel]: Love it !!
- Elon: ["liked" above]

2022-04-04
- Bret Taylor: Hey are you available?
- Bret Taylor: Given the SEC filing, would like to speak asap to coordinate on communications. Call asap when you are back

2022-04-04 [interleaved with above]
- [redacted]: Congratulations!! The above article ☝️ [seemingly referring to https://www.revolver.news/2022/04/elon-musk-buy-twitter-free-speech-tech-censorship-american-regime-war/] was laying out some of the things that might happen: Step 1: Blame the platform for its users Step 2. Coordinated pressure campaign Step 3: Exodus of the Bluechecks Step 4: Deplatforming "But it will not be easy. It will be a war. Let the battle begin."
- [redacted]: It will be a delicate game of letting right wingers back on Twitter and how to navigate that (especially the boss himself, if you're up for that) I would also lay out the standards early but have someone who has a savvy cultural/political view to be the VP of actual enforcement
- [redacted]: A Blake Masters type

2022-04-04 to 2022-04-17 [interleaved with above]
- Egon Durban: Hi -if you have a few moments call anytime? Flying to UK
- Elon: Just spoke to Bret. His call woke me up haha. Got it from Berlin at 4am.
- Egon: 🙏
- [2022-04-17] Elon: You're calling Morgan Stanley to speak poorly of me ...

2022-04-04
- Elon to Jared Birchall: Please talk to Martha about the filing
- Jared: ok

2022-04-04
- Bret Taylor: Do you have five minutes?
- Elon: Sure

2022-04-04
- Elon to Parag: Happy to talk if you'd like
- Parag: That will be very helpful. Please call me when you have a moment
- Elon: Just on the phone with Jared. Will call as soon as that's done.
- Parag: ["liked" above]

2022-04-04
- "Kyle": So can you bust us out of Twitter Jail now lol
- Elon: I do not have that ability
- Kyle: Lol I know I know. Big move though, love to see it

2022-04-04 [group chat with Egon Durban, "Martha Twitter NomGov", Brett Taylor, Parag Agrawal, and Elon Musk, interleaved with above]
- Elon: Thank you for considering me for the Twitter board, but, after thinking it over, my current time commitments would prevent me from being an effective board member. This may change in the future. Elon

2022-04-04 [interleaved with some of above]:
- Joe Rogan: Are you going to liberate Twitter from the censorship happy mob?
- Elon: I will provide advice, which they may or may not choose to follow

2022-04-04
- Bret Taylor: https://twitter.com/trungtohan/status/1510994320471429131?s=10&t=qrv_fOhTfUzRVDe_IbJKlQ
- Elon: ["laughed at" above]

2022-04-04 to 2022-04-05 [interleaved with above]
- Parag: You should have an updated agreement in your email. I'm available to chat.
- Elon: Approved
- Parag: ["loved" above]
- Parag: Have a few mins to chat? I'm eager to move fast
- Elon: Sure, I'm just on a SpaceX engine review call.
- Parag: Please call me after
- Parag: I'm excited to share that we're appointing @elonmusk to our board! Through conversations with Elon in recent weeks, it became clear to me that he would bring great value to our Board. Why? Above all else, he's both a passionate believer and intense critic of the service which is exactly what we need on Twitter, and in the Boardroom, to make us stronger in the long-term. Welcome Elon!
- Elon: Sounds good
- Elon: Sending out shortly?
- Parag: https://twitter.com/paraga/status/1511320953598357505?s=21&t=g9oXkMyPGFahuVNDKcoBa5A
- Elon: Cool
- Parag: Super excited!
- Elon: Likewise!
- Elon: Just had a great conversation with Jack! Are you free to talk later tonight?
- Parag: Yeah, what time?
- Elon: Would be great to unwind permanent bans, except for spam accounts and those that explicitly advocate violence.
- Elon: 7pm CA time? Or anytime after that.
- Parag: 7p works! Talk soon
- Elon: Calling back in a few mins
- Parag: ["liked" above]
- Elon: Pretty good summary
- Elon: https://twitter.com/stevenmarkryan/status/1511489781104275456?s=1O&t=LprG6-7KefKLzNX133IpjQ

2022-04-05 [group chat with Jared Birchall, "Martha Twitter NomGov", and Elon Musk]
- Martha: I'm so thrilled you're joining the board. I apologise about the bump of the first agreement-I'm not a good manager of lawyers. I really look forward to meeting you.
- Elon: Thanks Martha, same here.

2022-04-05 [interleaved with above]
- Bret: I am excited to work with you and grateful this worked out
- Elon: Likewise

2022-04-05 [interleaved with above]
- jack: Thank you for joining!
- jack: https://twitter.com/jack/status/1511329369473564677?s=21&t=DdrUUFvJPD7Kf-jXjBogIg
- Elon: Absolutely. Hope I can be helpfull
- jack: Immensely. Parag is an incredible engineer. The board is terrible. Always here to talk through anything you want.
- Elon: When is a good time to talk confidentially?
- jack: anytime
- Elon: Thanks, great conversation!
- jack: Always! I couldn't be happier you're doing this. I've wanted it for a long time. Got very emotional when I learned it was finally possible.
- Elon: ["loved" above]
- Elon: Please be super vocal if there is something dumb I'm doing or not doing. That would be greatly appreciated.
- jack: I trust you but def will do
- Elon: ["liked" above]
- jack: https://twitter.com/MattNavarra/status/1511773605239078914
- jack: Looks like there's a "verified" account in the swamp of despair over there. https://m.facebook.com/Elonmuskoffifref=nf&pn_ref=story&rc=p (promoting crypto too!)
- Elon: Haha
- [2022-04-26] jack: I want to make sure Parag is doing everything possible to build towards your goals until close. He is really great at getting things done when tasked with specific direction. Would it make sense for me you and him to get on a call to discuss next steps and get really clear on what's needed? He'd be able to move fast and clear then. Everyone is aligned and this will help even
- Elon: Sure
- jack: great when is best for you? And please let me know where/ifyou want my help. I just want to make this amazing and feel bound to it
- Elon: How about 7pm Central?
- Elon: Your help would be much appreciated
- Elon: I agreed with everything you said to me
- jack: Great! Will set up. I won't let this fail and will do whatever it takes. It's too critical to humanity.
- Elon: Absolutely
- jack: <Attachment-image/jpeg-Screen Shot 2022-04-26 at 15.05.00.jpeg>
- jack: I put together a draft list to make the discussion efficient. Goal is to align around 1) problems we're trying to solve, 2) longterm priorities, 3) short-tenn actions, all using a higher level guide you spoke about. Think about what you'd add/remove. Getting this nailed will increase velocity.
- jack: Here's meeting link for 7pm your time
- jack: [meeting URL]
- Elon: Great list of actions
- jack: We're on hangout whenever you're ready. No rush. Just working on refining doc.
- Elon: ["liked" above]
- Elon: It's asking me for a Google account login
- Elon: You and I are in complete agreement. Parag is just moving far too slowly and trying to please people who will not be happy no matter what he does.
- jack: At least it became clear that you can't work together. That was clarifying.
- Elon: Yeah

2022-04-06
- Ira Ehrenpreis [VC]: If you plan on joining the Nom/Gov or Comp Committees, lmk and I can give you some tips! Haha! 🤪
- Elon: Haha, I didn't even want to join the Twitter board! They pushed really hard to have me join.
- Ira: You're a pushover! 😂
- Ira: And you already got them to try the edit! Oh yeah... it had already been in the works. Sure.
- Elon: It was actually in the works, but I didn't know.

2022-04-06 to 2022-04-08
- Justin Roiland [co-creator of Rick and Morty]: I fucking love that you are majority owner of Twitter. My friends David and Daniel have a program that verifies identity that would be nice to connect to Twitter. As in, if people chose to use it, it could verify that they are a real person and not a troll farm. I should introduce you to them.
- Elon: I just own 9% of Twitter, so don't control the company.
- Elon: Will raise the identity issue with Parag (CEO).

2022-04-06 to 2022-04-14
- Gayle King [co-host for CBS Mornings and editor for The Oprah Magazine]: Gayle here! Have you missed me (smile) Are you ready for to do a proper sit down with me so much to discuss! especially with your twitter play ... what do I need to do ???? Ps I like a twitter edit feature with a24 hour time limit ... we all say shit we regret want to take back in the heat of the moment ...
- Elon: Twitter edit button is coming
- Gayle: The whole Twitter thing getting blown out of proportion
- Elon: Owning ~9% is not quite control
- Gayle: I never thought that it did ... and I'm not good in math
- Elon: Twitter should move more to the center, but Parag already thought that should be the case before I came along.
- Elon: ["laughed at" "I never thought..."]
- [2022-04-14] Gayle: ELON! You buying twitter or offering to buy twitter Wow! Now Don't you think we should sit down together face to face this is as the kids of today say a "gangsta move" I don't know know how shareholders turn this down .. like I said you are not like the other kids in the class ....
- Elon: ["loved above]
- [2022-04-18] Elon: Maybe Oprah would be interested in joining the Twitter board if my bid succeeds. Wisdom about humanity and knowing what is right are more important than so-called "board governance" skills, which mean pretty much nothing in my experience.

2022-04-07 to 2022-04-08 [interleaved with above]
- Parag: A host of ideas around this merit exploration - even lower friction ones than this.
- Elon: I have a thought about this that could take out two birds with one stone
- Elon: Btw, what's your email?
- Parag: [...]
- Parag: Would you be able to do a q&a for employees next week virtually? My travel is causing too long of a delay and only about 10-15% of audience will be in person so we will be optimizing for virtual anyways. Would any of Wed/Thu 11a pacific next week work for you for a 45 min video q&a?- else I can suggest other times. Trying to maximize attendance across global timezones.
- Parag: Would love to hear more when we speak nexts-do you have any availability tomorrow?
- Elon: Sure
- Elon: It would be great to get an update from the Twitter engineering team so that my suggestions are less dumb.
- Parag: Yep-will set up a product+ eng conversation ahead of q&a -they said, I expect most questions to not get into specific ideas / depth - but more around what you believe about the future of Twitter and why it matters, why you can personally, how to want to engage with us, what you hope to see change... -but also some from people who are upset that you are involved and generally don't like you for some reason. As you said yesterday, goal is for people to just hear you speak directly instead of make assumptions about you from media stories. Would Thursday 11a pacific work next week for the q&a?
- Elon: 11am PT on Wed works great
- Elon: Exactly. Thurs 11 PT works.
- Parag: Ok cool. So will confirm a convo Wed 11a PT with small eng and product leads. And the AMA on Thu 11a PT.
- Parag: Also: my email to company about AMA leaked already+ lots of leaks from internal slack messages: https://www.washingtonpost.com/technology/2022/04/07/musk-twitter-employee-outcry/ -I think there is a large silent majority that is excited about you bring on the board, so this isn't representative. Happy to talk about it-none of this is a surprise.
- Elon: Seedy
- Elon: *awesome (damn autocorrect!)
- Elon: As expected. Yeah, would be good to sync up. I can talk tomorrow night or anytime this weekend. I love our conversations!
- Parag: I'm totally flexible after 530p pacific tomorrow -let me know what works. And yes this is expected -and I think a good thing to move us in a positive direction. Despite the turmoil internally-I think this is very helpful in moving the company forward.
- Elon: Awesome!
- Elon: I have a ton of ideas, but lmk if I'm pushing too hard. I just want Twitter to be maximum amazing.
- Parag: I want to hear all the ideas -and I'll tell you which ones I'll make progress on vs. not. And why.
- Parag: And in this phase -just good to spend as much time with you. + have my Product and Eng team talk to you to ingest information on both sides.
- Elon: I would like to understand the technical details of the Twitter codebase. This will help me calibrate the dumbness of my suggestions.
- Elon: I wrote heavy duty software for 20 years
- Parag: I used to be CTO and have been in our codebase for a long time.
- Parag: So I can answer many many of your questions.
- Elon: I interface way better with engineers who are able to do hardcore programming than with program managero/ MBA types of people.
- Elon: ["liked" "I used to be CTO..."]
- Elon: 🔥🔥
- Parag: in our next convo-treat me like an engineer instead of CEO and lets see where we get to. I'll know after that convo who might be the best engineer to connect you to.
- Elon: Frankly, I hate doing mgmt stuff. I kinda don't think anyone should be the boss of anyone. But I love helping solve technical/product design problems.
- Elon: You got it!
- Parag: Look forward to speaking tomorrow. Do you like calendar invites sent to your email address?
- Elon: ["liked" above]
- Elon: I already put the two dates on my calendar, but no problem to send me supplementary stuff.
- Parag: I'm available starting now if you want to have a chat about engineering at Twitter. Let me know!
- Elon: Call in about 45 mins?
- Parag ["liked" above]
- Elon: Will call back shortly
- Elon: <Attachment• image/pngs-Screenshot 2022-04-08 at 10.10.09 PM.png>
- Elon: I am so sick of stuff like this
- Parag: We should be catching this
- Elon: Yeah

2022-04-09 to 2022-04-24
- Kimbal Musk [Elon's brother and owner of The Kitchen Restaurant Group]: I have an idea for a blockchain social media system that does both payments and short text messages/links like twitter. You have to pay a tiny amount to register your message on the chain, which will cut out the vast majority of spam and bots. There is no throat to choke, so free speech is guaranteed.
- Kimbal: The second piece of the puzzle is a massive real-time database that keeps a copy of all blockchain messages in memory, as well as all message sent to or received by you, your followers and those you follow.
- Kimbal: Third piece is a twitter-like app on your phone that accessed the database in the cloud.
- Kimbal: This could be massive
- Kimbal: I'd love to learn more. I've dug deep on Web3 (not crytpo as much} and the voting powers are amazing and verified. Lots you could do here for this as well
- Elon: I think a new social media company is needed that is based on a blockchain and includes payments
- Kimbal: Would have them pay w a token associated w the service? You'd have to hold the token in your wallet to post. Doesn't have to expensive it will grow over time in value
- Kimbal: Blockchain prevents people from deleting tweets. Pros and cons, but let the games begin!
- Kimbal: If you did use your own token, you would not needs advertising it's a pay for use service but at a very low price
- Kimbal: With scale it will be a huge business purely for the benefit of the users. I hate advertisements
- Elon: ["liked" above]
- Kimbal: There are some good ads out there. The voting component of interested users (only vote if you want to) could vote on ads that add value. The advertisers would have to stake a much larger amount of tokens, but other than there is no charge for the ads. It will bring out the creatives and the ads can politically incorrect/art/activision/philanthropy
- Kimbal: Voting rights could also crowdsource kicking scammers out. It drives me crazy when I see people promoting the scam that you're giving away Bitcoin. Lots of bad people out there
- [2022-04-24] Elon: Do you want to participate in the twitter transaction?
- Kimbal: Let's discuss tomorrow
- Elon: Ok
- Kimbal: I can break away from my group a lot of the time. Will text tomorrow afternoon and if you're free we can meet up
- Elon: Ok

2022-04-09 [interleaved with above]
- Parag: You are free to tweet "is Twitter dying?" or anything else about Twitter -but it's my responsibility to tell you that it's not helping me make Twitter better in the current context. Next time we speak, I'd like to you provide you perspective on the level of internal distraction right now and how it hurting our ability to do work. I hope the AMA will help people get to know you, to understand why you believe in Twitter, and to trust you -and I'd like the company to get to a place where we are more resilient and don't get distracted, but we aren't there right now.
- Elon: What did you get done this week?
- Elon: I'm not joining the board. This is a waste of time.
- Elon: Will make an offer to take Twitter private.
- Parag: Can we talk?

2022-04-09 to 2022-04-10 [interleaved with above]
- Bret: Parag just called me and mentioned your text conversation. Can you talk?
- Elon: Please expect a take private offer
- Bret: I saw the text thread. Do you have five minutes so I can understand the context? I don't currently
- Elon: Fixing twitter by chatting with Parag won't work
- Elon: Drastic action is needed
- Elon: This is hard to do as a public company, as purging fake users will make the numbers look terrible, so restructuring should be done as a private company.
- Elon: This is Jack's opinion too.
- Bret: Can you take 10 minutes to talk this through with me? It has been about 24 hours since you joined the board. I get your point, but just want to understand about the sudden pivot and make sure I deeply understand your point of view and the path forward
- Elon: I'm about to take off, but can talk tomorrow
- Bret: Thank you
- Bret: Heyo-can you speak this evening? I have seen your tweets and feel more urgency about understanding your path forward
- [next day] Bret: Acknowledging your text with Parag yesterday that you are declining to join the board. This will be reflected in our 8-K tomorrow. I've asked our team to share a draft with your family office today. I'm looking forward to speaking today.
- Elon: Sounds good
- Elon: It is better, in my opinion, to take Twitter private, restructure and return to the public markets once that is done. That was also Jack's view when I talked to him.

2022-04-12
- Michael Kives [Hollywood talent agent]: Have any time to see Philippe Laffont in Vancouver tomorrow?
- Elon: Maybe
- Michael: Any particular time of best?
- Michael: any time best?
- Elon: What exactly does he want?
- Michael: Has some ideas on Twitter Owns a billion of Tesla Did last 2 or 3 SpaceX rounds And-wants to get into Boring in the future (I told him to help with recruiting) You could honestly do like 20 mins in your hotel He's super smart, good guy
- Elon: Ok, he can come by tonight. Room 1001 at Shangri-La.
- Michael: Need to find you a great assistant! I'm headed to bed I'll tell Philippe to email you when he lands tonight in case you're still up and want to meet
- Michael: https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&tZ4pA_Ct35ud6M60g3ng
- Michael: Could be cool to do this with Sam Bankman-Fried
- [2022-04-28] Elon: Twitter is obviously not going to be turned into some right wing nuthouse. Aiming to be as broadly inclusive as possible. Do the right thing for vast majority of Americans.
- Michael: ["liked" above]

2022-04-13 to 2022-04-15
- Elon to Bret: After several days of deliberation -this is obviously a matter of serious gravity-I have decided to move forward with taking Twitter private. I will send you an offer letter tonight, which will be public in the morning. Happy to connect you with my team if you have any questions. Thanks, Elon
- Bret: Acknowledged
- Bret: Confirming I received your email. Also, please use [...] going forward, my personal email.
- Elon: Will do
- [2022-04-14] Bret: Elon, as you saw from our press release, the board is in receipt of your letter and is evaluating your proposal to determine the course of action that it believes is in the best interest of Twitter and all of its stockholders. We will be back in touch with you when we have completed that work. Bret
- Elon: Sounds good
- [2022-04-17] Bret: Elon, I am just checking in to reiterate that the board Is seriously reviewing the proposal in your letter. We are working on a formal response as quickly as we can consistent with our fiduciary duties. Feel free to reach out anytime.
- Elon: ["liked" above]

2022-04-14
- Elon to Steve Davis [President of Boring Company]: My Plan B is a blockchain-based version of twitter, where the "tweets" are embedded in the transaction as comments. So you'd have to pay maybe 0.1 Doge per comment or repost of that comment.
- Elon: https://twitter.com/elonmusk/status/1514564966564651008?s=1O&t=OfO6fmJ_4DuQrOrdkKIT0gQ Self
- Steve: Amazing! Not sure which plan to root for. If Plan B wins, let me know if blockchain engineers would be helpful.

2022-04-14 [group chat with Will MacAskill, Sam BF, and Elon Musk, interleaved with above]
- Sam BF:https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&t=n10hLHFilyMognjOucltw

2022-04-14 to 2022-04-16
- Marc Merrill [co-founder and President of Riot Games]: https://ground.news/article/elon-musk-offers-to-buy-twitter-for-4139-billion_20a2b3
- Marc: you are the hero Gotham needs - hell F'ing yes!

2022-04-14 to 2022-04-15
- Jason Calacanis [VC]: You should raise your offer
- Jason: $54.21
- Jason: The perfect counter
- Jason: You could easily clean up bots and spam and make the service viable for many more users —Removing bots and spam is a lot less complicated than what the Tesla self driving team is doing (based on hearing the last edge case meeting)
- Jason: And why should blue check marks be limited to the elite, press and celebrities? How is that democratic?
- Jason: The Kingdom would like a word.. https://twitter.com/Alwaleed_TalaI/status/1514615956986757127?s=20&t=2q4VfMBXrldYGj3vFN_r0w 😂😂😂
- Jason: Back of the envelope... Twitter revenue per employee: $5B rev / 8k employees = $625K rev per employee in 2021 Google revenue per employee: $257B rev2/ 135K employee2= $1.9M per employee in 2021 Apple revenue per employee: $365B rev / 154k employees= $2.37M per employee in fiscal 2021
- Jason: Twitter revenue per employee if 3k instead of 8k: $5B rev/ 3k employees= $1.66m rev per employee in 2021 (more industry standard)
- Elon: ["emphasized" above]
- Elon: Insane potential for improvement
- Jason: <Attachment-image/gif-lMG_2241.GIF>
- Jason: Day zero
- Jason: Sharpen your blades boys 🗡️
- Jason: 2 day a week Office requirement= 20% voluntary departures
- Jason: https://twitter.com/jason/status/1515094823337832448?s=1O&t=CWr2U7sH4wVOsohPgjKRg
- Jason: I mean, the product road map is beyond obviously
- Jason: Premium feature abound ... and twitter blue has exactly zero [unknown emoji]
- Jason: What committee came up with the list of dog shit features in Blue?!? It's worth paying to turn it off
- Elon: Yeah, what an insane piece of shit!
- Jason: Maybe we don't talk twitter on twitter OM @
- Elon: Was just thinking that haha
- Elon: Nothing said there so far is anything different from what I said publicly.
- Elon: Btw, Parag is still on a ten day vacation in Hawaii
- Jason: No reason to cut it short... in your first tour as ceo
- Jason: (!!!)
- Jason: Shouldn't he be in a war room right now?!?
- Elon: Does doing occasional zoom calls while drinking fruity cocktails at the Four Seasons count?
- Jason: 🤔
- Jason: https://twitter.com/jason/status/1515427935263490053?s=10&t=4rQ_JIDXCDtHhOaXdGHJ5g
- Jason: I'm starting a DAO
- Jason: 😂😂😂
- Jason: Money goes to buy twitter shares, if you don't wine money goes to open source twitter competitor 😂😂😂
- Elon: ["liked" above]
- [2022-04-23] Elon: I will be Universally beloved, since it is so easy to please everyone on twitter
- Jason: It feels like everyone wants the same exact thing, and they will be patient and understanding of any changes ... Twitter Stans are a reasonable, good faith bunch
- Jason: These dipshits spent a years on twitter blue to give people exactly..... Nothing they want! * Jason: Splitting revenue with video creators like YouTube could be huge unlock
- Jason: We could literally give video creators 100% of their ad revenue up to $1m then do split
- Elon: Absolutely
- Jason: 5 Teams: 5 Northstar metrics 1. Legacy Opps: uptime, speed 2. Membership team: remove bots while getting users to pay far "Real Name Memberships" $5 a month $SO a year. Includes 24 hours response to customer service 3. Payments: % of users that have connected a bank account/made a deposit 4. Creator Team: get creators to publish to twitter first (musicians, You Tubers, tiktokers, etc) by giving them the best % split in the industry (and promotion) 5. Transparency Team: make the Algorithm & Moderation understandable and fair
- Jason: I think those are the 5 critical pieces ... everyone agrees to "year one" sprint, including coming back to offices within the first 60 days (unless given special dispensation for extraordinary contribution)
- Jason: Hard Reboot the organization
- Jason: Feels like no one is setting priorities ruthlessly .. 12,000 working on whatever they want?!? No projects being cancelled?!
- Jason: Move HQ to Austin, rent gigafactory excess space
- Elon: Want to be a strategic advisor if this works out?
- Elon: Want to be a strategic advisor to Twitter if this works out?
- Jason: Board member, advisor, whatever ... you have my sword
- Elon: ["loved" above]
- Jason: If 2, 3 or 4 unlock they are each 250b+ markets
- Jason: Payments is $250-500b, YouTube/creators is $250b+
- Jason: Membership no one has tried really .... So hard to estimate. 1-5m paid members maybe @ Jason $50-100 a year? 250k corporate memberships @ 10k a year?
- Elon: You are a mind reader
- Jason: Put me in the game coach!
- Jason: [unclear emoji]
- Jason: Twitter CEO is my dream job
- Jason: https://apple.news/AIDqUaC24Sguyc9S9krWlig
- Jason: we should get Mr Beast to create for twitter ... we need to win the next two generations (millennials and Z are "meh" on twitter)
- Elon: For sure
- Jason: Just had the best idea eve.for monetization ... if you pay .01 per follower per year, you can DM all your followers upto 1x a day.
- Jason: 500,000 follows = $5,000 and 1 DM them when 1 have new podcast episode, or I'm doing an event... or my new book comes out
- Jason: And let folks slice and dice... so, you could DM all your twitter followers in Ber1in and invite them to the GigaRave
- Jason: Oh my lord this would unlock the power of Twitter and goose revenue massively .... Who wouldn't pay for this!?!?
- Jason: and if you over use the tool and are annoying folks would unfollow you ... so it's got a built in Jason safe guard {unlike email spam)
- Jason: Imagine we ask Justin Beaver to come back and let him DM his fans ... he could sell $10m in merchandise or tickets instantly. Would be INSANE for power users and companies
- Elon: Hell yeahl!
- Elon: It will take a few months for the deal to complete before I'm actually in control

2022-04-14
- [redacted]: Hey Elon -my name is Jake Sherman. I'm a reporter with Punchbowl News in Washington I cover Congress. Wonder if you're game to talk about how Twitter would change for politics if you were at the helm?

2022-04-14
- Adeo Ressi [VC]: Would love you to buy Twitter and fix it 🙏
- Elon: ["loved" above]

2022-04-14
- Omead Afshar [Project Director for Office of CEO at Tesla]: Thank you for what you're doing. We all love you and are always behind you! Not having a global platform that is truly free speech is dangerous for all. Companies are all adopting some form of content moderation and it's all dependent on ownership on how it shifts and advertisers paying them, as you've said.
- Omead: Who knew a Saudi Arabian prince had so much leverage and so much to say about twitter.

2022-04-20
- Elon to Brian Kingston [investment management, real estate]: Not at all

2022-04-20 to 2022-04-22
- Elon: Larry Ellison is interested in being part of the Twitter take-private
- Jared: ["liked above]
- Jared: https://www.bloomberg.com/news/articles/2022-04-19/ftx-ceo-bankman-fried-wants-to-fix-social-media-with-blockchain
- Jared: <Attachmento-text/vcard - Sam Bankman-Fried.vcf>
- Jared: He seems to point to a similar blockchain based idea Also, we now have the right software engineer for you to speak with about the blockchain idea. Do you want an intro? Or just contact info?
- Elon: Who is this person and who recommended them?
- Elon: The engineer
- Elon: I mean
- Elon: The idea of blockchain free speech has been around for a long time. The questions are really about how to implement it.
- Jared: former spacex'r, current CTO at Matter Labs, a blockchain company. TBC is on the verge of hiring him.
- Jared: https://www.linkedin.com/in/anthonykrose/
- Elon: Ok
- Jared: best to intro you via email?
- Elon: Yeah
- Jared: Investor calls are currently scheduled from 1pm to 3pm. They'd like to do a brief check in beforehand.
- Elon: Ok
- Elon: Whatever time works
- Jared: ["liked" above]
- Jared: I'll be dialing in to the calls as well. Let me know if you prefer that I'm there in person at the house or if remote is better.
- Elon: Remote is fine

2022-04-22
- "BL Lee": i have a twitter ceo candidate for you -bill gurley/benchmark. they were early investors as well so know all the drama. want to meet him?

2022-04-23
- Elon to James Gorman [CEO, Morgan Stanley]: Thanks James, your unwavering support is deeply appreciated. Elon
- Elon: I think the tender has a real chance

2022-04-23
- Elon to Bret: Would it be possible for you and me to talk this weekend?
- Elon: Or any group of people from the Twitter and my side
- Bret: Yes, that would be great. I would suggest me and Sam Britton from Goldman on our side. Do you have time this afternoon / evening?
- Elon: Sounds good
- Elon: Whatever time works for you and Sam is good for me
- Bret: Can we call you in 15 mins? 4:30pm PT (not sure what time zone you are in)
- Bret: I can just call your mobile. Let me know if you prefer Zoom, conference call code, or something different
- Elon: Sure
- Elon: Mobile is fine
- Bret: ["liked" above]
- Bret: Just tried callings-please call whenever you are available
- Elon: Calling shortly
- Bret: Great thanks
- Elon: Morgan Stanley needs to talk to me. I will call as soon as that's done.
- Bret: No problem -here when you are ready
- Elon: ["liked" above]
- Bret: I understand our advisors just had a productive call. I am available to speak after you've debriefed with them.
- Elon: Sounds good

2022-04-23 [interleaved with above]
- Joe Rogan: I REALLY hope you get Twitter. If you do, we should throw a hell of a party.
- Elon: 💯

2022-04-23 to 2022-04-26
- "Mike Pop": For sure
- Mike: defiantly things can be better and more culturally engaged
- Mike: I think you're in a unique position to broker better AI to detect bots the second they pop up
- Elon: ["liked" above]
- [2022-04-026] Mike: When do I start boss
- Elon: It will take at least a few months to dose the deal
- Mike: ["loved" above]

2022-04-25
- Bret: Time for a quick check in?
- Bret: Will call you back
- Bret: In a bit
- Elon: Ok
- Bret: https://twitter.com/btaylor/status/1518664708177362944?s=lO&t=9WqlCSZVMQdycPc314T
- Elon: ["loved" above]
- Elon: Thank you
- Bret: Here to make this successful in any way I can
- Elon: ["liked" above]

2022-04-25 to 2022-04-26 [interleaved with some above]
- Elon to Parag: Can I call you later? I have the SpaceX exec staff meeting right now. Will be done in half an hour. Do you need to talk before then?
- Parag: No -can talk in 30!
- Elon: ["liked above]
- [2022-04-26] Elon: Good question ...
- Elon: https://twitter.com/norsemen62/status/1519005154204336128?s=1O&t=MKtYF6Wu2sSTdoWWqThEDg

2022-04-25 [interleaved wth some above]
- jack: Thank you ❤️
- Elon: ["loved" above]
- Elon: I basically following your advice!
- jack: I know and I appreciate you. This is the right and only path. I'll continue to do whatever it takes to make it work.
- Elon: ["liked" above]

2022-04-25
- Elon to Tim Urban [creator of What But Why]: Absolutely
- Tim: i haven't officially started my podcast yet but if you think it would be helpful, i'd be happy to record a conversation with you about twitter to ask some of the most common questions and let you expand upon your thoughts Tim Urban
- Tim: but only if it would be helpful to you
- Elon: Suee
- Tim: Any day or time that's best for you? And best location? I'm in LA but can zip over to Austin if you're there.
- Elon: Probably in a few weeks
- Tim: ["liked" above]

2022-04-25
- Michael Grimes [IBanker at Morgan Stanley]: Do you have 5 minutes to connect on possible meeting tomorrow I believe you will want to take?
- Elon: Will call in about half an hour
- Michael: Sam Bankman Fried is why I'm calling https://twitter.com/sbf_ftx/status/1514588820641128452 https://www.vox.com/platform/amp/recode/2021/3/20/22335209/sam-bank.man-fried-joe-biden-ftx-cryptocurrency-effective-altruism https://ftx.us
- Elon: ??
- Elon: I'm backlogged with a mountain of critical work matters. ls this urgent?
- Michael: Wants 1-Sb. Serious about partner w/you. Same security you own
- Michael: Not urgent unless you want him to fly tomorrow. He has a window tomorrow then he's wed-Friday booked
- Michael: Could do $5bn if everything vision lock. Would do the engineering for social media blockchain integration. Founded FTX crypto exchange. Believes in your mission. Major Democratic donor. So thought it was potentially worth an hour tomorrow a la the Orlando meeting and he said he could shake hands on 5 if you like him and I think you will. Can talk when you have more time not urgent but if tomorrow works it could get us $5bn equity in an hour
- Elon: Blockchain twitter isn't possible, as the bandwidth and latency requirements cannot be supported by a peer to peer network, unless those "peers" are absolutely gigantic, thus defeating the purpose of a decentralized network.
- Elon: ["disliked" "Could do $5bn ..."]
- Elon: So long as I don't have to have a laborious blockchain debate
- Elon: Strange that Orlando declined
- Elon: Please let him know that I would like to talk and understand why he declined
- Elon: Does Sam actually have $3B liquid?
- Michael: I think Sam has it yes. He actually said up to 10 at one point but in writing he said up to 5. He's into you. And he specifically said the blockchain piece is only if you liked it and not gonna push it. Orlando referred Sams interest to us and will be texting you to speak to say why he (Orlando) declined. We agree orlando needs to call you and explain given everything he said to us and you. Will make that happen We can push Sam to next week but I do believe you will like him. Ultra Genius and doer builder like your formula. Built FTX from scratch after MIT physics. Second to Bloomberg in donations to Biden campaign.

2022-04-25 to 2022-04-28
- Elon to David Sacks [VC]: https://twitter.com/dineshdsouza/status/1518744328205647872?s=10&t=vkagBUrJJexF_SJDOC_LUw
- David Sacks: RT'd
- David: Justin Amash (former congressman who's liberterian and good on free speech) asked far an intro to you: "I believe I can be helpful to Twitter's team going forward-thinking about how to handle speech and moderation, how that intersects with ideas about governance, how to navigate actual government (including future threats to Section 230), etc.-and I'd love to connect with Elon if he's interested in connecting (I don't have a direct way to contact him). I believe my experience and expertise can be useful, and my genera! outlook aligns with his stated goals. Thanks. All the best." Please LMK if you want to connect with him.
- David: https://twitter.com/justinamash?s=21&t=_Owbgwdot71pUtC4rJUXYg
- Elon: I don't own twitter yet
- David: Understood.
- [2022-04-28] Elon: Do you and/or find want to invest in the take private?
- Elon: *fund
- David: Yes but I don't have a vehicle for it (Craft is venture} so either I need to set up an SPV or just do it personally. If the latter, my amount would be mice-nuts in relative terms but I would be happy to participate to support the cause.
- Elon: Up to you
- David: Ok cool, let me know.
- David: I'm in personally and will raise an SPV too if that works for you.
- Elon: Sure

2022-04-26
- James Murdoch [Rupert Murdoch's son]: Thank you. I will link you up. Also will call when same of the dust settles. Hope all's ok..
- Elon: ["liked" above]

2022-04-26 [group chat with Kathryn Murdoch, James Murdoch, and Elon Musk]
- Kathryn Murdoch [James Murdoch's wife]: Will you bring back Jack?
- Elon: Jack doesn't want to come back. He is focused on Bitcoin.

2022-04-27
- [redacted]: Hi Elon, This is Maddie, Larry Ellison's assistant. Larry asked that I connect the head of his family office, Paul Marinelli, with the head of yours. Would you please share their contact details? Alternatively, please provide them with Paul's: Cell: [...] Email: [...]
- Elon: [Jared's email]
- [redacted]: Thank you.

2022-04-27
- Marc Benoiff [co-founder, chair, co-CEO of Salesforce, along with Bret Taylor]: Happy to talk about it if this is interesting: Twitter conversational OS-the townsquare for your digital life.
- Elon: Well I don't own it yet

2022-04-27
- Reid Hoffman [VC] Great. I will put you in touch with Satya.
- Elon: Sounds good
- Elon: Do you want to invest in Twitter take private?
- Reid: It's way beyond my resources. I presume you are not interested in ventures$.
- Elon: There is plenty of financial support, but you're a friend, so just letting you know you'd get priority. VC money is fine if you want.
- Reid: Very cool! OK -if I were to put together$, what size could you make available? [unclear emoji, some kind of smiling face]
- Elon: Whatever you'd like. I will just cut back others.
- Elon: I would need to know the approximate by next week
- Reid: What would be the largest $ that would be ok? I consulted with our LPs, and I have strong demand. Would be fun!
- Elon: $2B?
- Reid: Great. Probably doable -let me see.
- Elon: Can be less if easier. The round is oversubscribed, so I just have to tell other investors what their allocation is ideally bv early next week.
- Elon: Should I connect you with the Morgan Stanley team?
- Reid: Yes please. Especially with the terms, etc. I know Michael Grimes, btw.
- Elon: Please feel free to call him directly
- [group chat message connecting Reid to Jared Birchall]
- Reid: OK-I'll do that. (Trying to simplify your massively busy life.) The Morgan Stanley deal team is truly excellent and I don't say such things lightly.
- [group chat messages between Reid and Jared to exchange emails]
- Reid: Indeed! I took U public and the MSFT-U deal with them!

2022-04-27 [interleaved with above]
- Viv Huantusch: From a social perspectives-Twitter allowing for high quality video uploads (1080p at a minimum) & adding a basic in-app video editor would have quite a big impact I think. Especially useful for citizen journalism & fun educational content. Might even help Twitter regain market share lost to TikTok
- Elon: Agreed
- Elon: Twrtter can't monetize video yet, so video is a loss for Twitter and for the those who post
- Elon: Twitter needs better guidance
- Viv: Yeah, 100%
- Viv: They should have a subscription that's actually useful (unlike Twitter Blue haha)
- Elon: Totally

2022-04-27 [group chat with "Satya" [presumed to be Satya Nadella CEO of Microsoft], Reid Hoffman, and Elon Musk, interleaved with above]
- Elon, Satya: as indicated, this connects the two of you by text and phone.
- Satya: Thx Reid. Efon -will text and coordinate a time to chat. Thx

2022-04-27 [interleaved with above]
- Satya: Hi Elon .. Let me know when you have time to chat. can do tomorrow evening or weekend. Look forward to it. ThxSatya
- Elon: I can talk now if you want
- Satya: Calling
- Satya: Thx for the chat. Will stay in touch. And will for sure follow-up on Teams feedback!
- Elon: sounds good :)

2022-04-27 [interleaved with above]
- Elon to Brian Acton [interim CEO of Signal]: Trying to figure out what to do with Twitter DMs. They should be end to end encrypted (obv). Dunno if better to have redundancy with Signal or integrate it.

2022-04-27 [interleaved with above]
- Elon to Bret Taylor: I'd like to convey some critical elements of the transition plan. Is there a good time for us to talk tonight? Happy to have anyone from Twitter on the call.
- Elon: My biggest concern is headcount and expense growth. Twitter has ~3X the head count per unit of revenue of other social media companies, which is very unhealthy in my view.

2022-04-28 [group chat with Jared Birchall, Sam BF, and Elon Musk]
- Jared Birchall: Elon -connecting you with SBF.
- Sam BF: Hey!

2022-04-29
- Steve Jurvetson [VC]: https://www.linkedin.com/in/emilmichael/
- Steve: If you are looking for someone to run the Twitter revamping .... perhaps as some kind of CXO under you ... Emil Michael is a friend that just offered that idea. Genevieve loved working for him at Klout. He went on to become Chief Business Officer of Uber for 2013-17.
- Elon: I don't have a Unkedln account
- Elon: I don't think we will have any CXO titles
- Steve: OK. Are you looking to hire anyone, or do you plan to run it?
- Steve: <Attachmente-image/jpeg-Screen Shot 2022-04-29 at5.49.53 PM.jpeg>
- steve: This is his experience prior to Uber:
- Elon: Please send me anyone who actually writes good software
- Steve: Ok, no management; good coders, got it.
- Elon: Yes
- Elon: Twitter is a software company (or should be)
- Steve: Yes. My son at Reddit and some other young people come to mind. I was thinking about who is going to manage the software people (to prioritize and hit deadlines), and I guess that's you.
- Elon: I will oversee software development

Exihibit J

2022-04-04 to 2022-04-14
- Mathias Döpfner: 👍
- [2022-04-14] Mathias: <Attachment - application/vnd.openxmlformats-officedocument.wordprocessingml.document-Twitter_lnterview.doc>

2022-04-11
- Kimbal Musk: Great to hang yesterday. I'd love to help think through the structure for the Doge social media idea Let me know how I can help
- Elon: Ok

2022-04-14
- Elon to Marc Merill: ["loved" "you are the hero Gotham needs -hell F'ing yes!"]

2022-04-14
- Elon to Steve Davis: ["liked" "Amazing! Not sure which plan to root for. If Plan B wins, let me know if blockchain engineers would be helpful."]

2022-04-15
- Elon to Omead Afshar: ["laughed at" "Who knew a Saudi Arabian prince had so much leverage and so much to say about twitter."]

2022-04-20
- Brian Kingston: Hi Elon-it's Brian Kingston at Brookfield. There was an artide today in the FT that said we (Brookfield) have "decided against providing an equity cheque• for a Twitter buyout. I Just wanted to let you know that didn't came from us-we would never comment (on or off the record) about something like that, particularly when it relates to one of our partners. We appreciate all that we are doing on solar together and you allowing us to participate in the Boring Co raise this week. While I'm sure you don't believe anything you read in the FT anyway, I'm sorry if the article caused any aggravation. If there is anything we can do to be helpful, please do let me know.

2022-04-23 to 2022-05-09
- Micahel Grimes: Michael Grimes here so you have my number and know who is calling. Dialing you now
- Elon: ["liked" above]
- Micahel: https://youtu.be/DOW1V0kOELA
- Elon: ["laughed at" above]
- Micahel: If you have a second to chat
- Michael: Perfect.
- Michael: got it. Will forward the equity Interest email to Jared and Alex that he sent in and have It in the queue in the event his interest is needed overall. Absent the blockchain piece he's focused on investing if you want his interest in Twitter and your mission but we can park him for now.
- Michael: Agree. Was one piece of equation and I do think he would be at least3bn if you like him and want him, maybe more. Will work with Jared and Alex to be sure it makes sense to meet -my instinct is it does because Orlando Brace also declined today in the end (not sure if political fears or what but he fiaked today}.
- [2022-05-04] Elon: No response from Bret, not even an interest in talking. I think it's probably best to release the debt tomorrow. This might take a while.
- Micahel Grimes: Nikesh came to see me this afternoon. Just to talk Twitter and you. If you had the time he would cancel his plans tomorrow night to meet with you and come to where you are in SF or mid peninsula Or he could fly to Austin another time of course. If you want me to send him to you let me know and he will break his lans to do
- Elon: It's fine, no need to break his plans.
- Michael: Got it.
- Michael: I asked Pat and Kristina to each spend the weekend writing up their transition and diligence plan and how to approach debt rating agencies on may 16. We need one of them signed up (employment contract for 3 months) as Transition CFO of X Holdings and owning the model and diligence from financial point of view on the follow up meetings with Twitter on costs and users and engineers etc. We believe two will not work at the agencies or in front of debt investors as you have to have one CFO. If you were willing to have SVP Ops of X Holdings (Pat would be more qualified for that than Kristina I then it's possible to retain them both for the transition. The way to stay on ludicrous speed Is to pick one of them tomorrow as transition CFO and then we run with it full metal jacket. I believe each can do the job and deliver the ratings and debt and transition plan for day one. Then you dismiss him/her as job well done or offer permanent CFO if you choose.
- Elon: Neither were great.
- Elon: They asked no good questions and had no good comments.
- Elon: Let's slow down just a few days
- Elon: Putin's speech tomorrow is extremely important
- Elon: It won't make sense to buy Twitter if we're headed into WW3
- Elon: Just sayin
- Michael: Understood. If the pace stays rapid each are good enough to get job done for the debt Then you hire great for go forward. But will pause for May 9 Vladimir and hope for the best there. We can take stock of where things look after that.
- Elon: ["liked" above]
- Elon: An extremely fundamental due diligence item is understanding exactly how Twitter confims that 95% of their daily active users are both real people and not double-counted.
- Elon: They couldn't answer that on Friday, which is insane.
- Elon: If that number is more like 50% or lower, which is what I would guess based on my feed, then they have been fundamentally misrepresenting the value of Twitter to advertisers and investors.
- Elon: To be super clear, this deal moves forward if it passes due diligence, but obviously not if there are massive gaping issues.
- Elon: True user account is a showstopper if actually much lower than the 95% claimed
- Elon: Parag said thatTwitter has 2500 coders doing at least 100 lines per month. Maybe they could fit this feature in ... https://twitter.com/skylerrainnj/status/1523616659365277698?s=1O&t=1qmVNhjQPeHafBPEHiFrRQ

2022-04-25
- Adeo Ressi: <Attachment-image/jpego-Elon Musk and Twitter Reach Deal on Sale Live Up....jpeg>
- Adeo: Congrats? This will be a good thing.
- Elon: I hope so :)
- Adeo: You've had ideas on how to fix that companyfor A LONG TIME. The time is now.
- Adeo: I think it's exciting.

2022-04-25
- James Musk: Congrats! Super important to solve the bot problem.
- Elon: Thanks
- Elon: The bot problem is severe

2022-04-27
- Elon to Reid Hoffman: This is Elon

2022-05-01
- Elon to Sean Parker [VC and founder of Napster]: Am at my Mom's a apartment, doing Twitter dilligence calls

2022-05-02
- Jason Calacanis: https://twitter.com/elonmusk/status/1521158715193315328?s=1O&t=htc_On6KY9B9C4VtllFIO
- Jason: one thing you can do in this regard is an SPV of 250 folks capped at $10m .. pain on the next for a large company but one item on cap table
- Jason: You do have to have someone lead/man a e the SPV
- Elon: Go ahead
- Jason: ["liked" above]
- Jason: When you're private its fairly easy to do, but I think current shareholders have to re-up
- Elon: Are you sure?
- Jason: I am not
- Jason: Have never done a take private
- Jason: Large shareholders (QPs) are likely different than non-accredited investors
- [2022-05-12] Elon: What's going on with ou marketing an SPV to randos? This is not ok.
- Jason: Not randos, I have the largest angel syndicate and that's how I invest. We've done 25D+ deals like this and we know all the folks. I though that was how folks were doin it.
- Jason: $100m+ on commitments, but if that not ok it's fine. Just wanted to support the effort.
- Jason: ~300 QPs and 200 accredited investors said they would do it. It's not an open process obviously, only folks alread in our syndicate.
- Jason: There is massive demand to support uyour effort btw...people really want to see you win.
- Elon: Morgan Stanley and Jared thing you are using our friendship not in a good way
- Elon: This makes it seem like I'm desperate
- Elon: Please sto
- Jason: Only ever want to support you.
- Jason: Clearly you're not desperate -you have the worlds greatest investors voting in support of a deal you already have covered. you're overfunded. will quietly cancel it... And to be clear, I'm not out actively soliciting folks.These are our exiting LPs not rondos. Sorry forany trouble
- Elon: Morgan Stanley and Jared are very upset
- Jason: Ugh
- Jason: SPVs are how everyone is doing there deals now... Like loved to SPVs etc
- Jason: Just trying to support you... obviously, I reached out to Jared and sort it out.
- Jason: * moved
- Elon: Yes, I had to ask him to stop.
- Elon: ["liked" "Just trying to support..."]
- Jason: Cleaned it up with Jared
- Elon: ["liked" above]
- Jason: I get where he is coming from.... Candidly, This deal has just captures the worlds imagination In an unimaginable way. It's bonkers...
- Jason: And you know I'm ride or die brother - I'd jump on a grande for you
- Elon: ["loved" above]

2022-05-05
- Elon to Sam BF: Sorry, who is sending this message?

2022-05-05
- Elon to James Murdoch: In LA right now. SF tomorrow to due dilligence on Twitter.

2022-05-05
- Elon to John Elkann [heir of Gianni Agnelli]: Sorry, I have to be at Twitter HQ tomorrow afternoon for due dilligence.

2022-05-05
- David Sacks: ["liked" "Best to be low-key during transaction "]

2022-05-10 [unclear what's happening due to redactions; may be more than one convo here]
- Antonio Gracias: Connecting you.
- [redacted]: Hi Elon This is Peter and my numbers. Look forward to being helpful Bob
- Elon: Got it
- Elon: Should we use the above two numbers for the conf call?
- [redacted]: Sure

2022-06-16 to 2022-06-17
- [redacted]: If I understood them correctly, Ned [presumebly Ned Segal, CFO of Twitter] and Parag said that cash expenditures over the next 12 months will be $78 and that cash receipts will also be $78. However, the cash receipts number doesn't seem realistic, given that they expect only $1.2B in CU, which is just $4.8B annualized.
- [redacted]: In europe so just getting your msg. i do not have proxy w me but my guess is they are using their proxy numbers vs current reality. we are developing proformas that have lower revenue/receipts and lower disbursements.
- Elon: Ok. Given that Q2 is almost over, itobviousl doesn't make sense for them to use proxy numbers vs [looks like something is cut off here — seems like text is in a spreadsheet and word wrap wasn't used on this row, which was then printed and scanned in]
- Elon: I'm traveling in Europe right now, but back next week
- [redacted]: i spokewned on the 7b receipts/expenses. he said he was trying to be more illustrative on '23 expense base, pre any actions we would take and provide a simplified strawman of possible savings. he said they are not planning on doing an updated fcst for 22/23. i think this Is ok re process since i think their fcst would not likely be very good and we wouldn't likely use It anyways. They fly at way too high a level to have a fcst of much value. We are in process of developing revenue fcst and a range of sensitivities and will then walk thru w them to get their input.
- Elon: Their revenue projections seem disconnected from reality
- [redacted]: completely.
- Elon: Phew, it's not just me

Equity / financing commitments from 2022-05-05 SEC filing

If you're curious about the outcomes of the funding discussions above, the winners are listed in the Schedule 13D

HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud (Kingdom): ~$1.9B
Lawrence J. Ellison Revocable Trust: $1B
Sequoia Capital Fund, L.P.: $0.8B
VyCapital: $0.7B
Binance:$0.5B
AH Capital Management, L.L.C. (a16z): $0.4B
Qatar Holding LLC: $0.375B
Aliya Capital Partners LLC: $0.36B
Fidelity Management & Research Company LLC: ~$0.316B
Brookfield: $0.25B
Strauss Capital LLC: $0.15B
BAMCO, Inc. (Baron): $0.1B
DFJ Growth IV Partners, LLC: $0.1B
Witkoff Capital: $0.1B
Key Wealth Advisors LLC: $0.03B
A.M. Management & Consulting: $0.025B
Litani Ventures: $0.025B
Tresser Blvd 402 LLC (Cartenna): $0.0085B
Honeycomb Asset Management LP: $0.005B

Thanks to @tech31842, @agentwaj, and mr. zip for OCR corrections

Futurist prediction methods and accuracy

Mon, 12 Sep 2022 00:00:00 +0000

I've been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate these predictions of the future are. The timeframe of predictions that are so far out means that only a tiny fraction of people making those kinds of predictions today have a track record so, if we want to evaluate which predictions are plausible, we need to look at something other than track record.

The idea behind the approach of this post was to look at predictions from an independently chosen set of predictors (Wikipedia's list of well-known futurists¹) whose predictions are old enough to evaluate in order to understand which prediction techniques worked and which ones didn't work, allowing us to then (mostly in a future post) evaluate the plausibility of predictions that use similar methodologies.

Unfortunately, every single predictor from the independently chosen set had a poor record and, on spot checking some predictions from other futurists, it appears that futurists often have a fairly poor track record of predictions so, in order to contrast techniques that worked with techniques that I didn't, I sourced predictors that have a decent track record from my memory, an non-independent source which introduces quite a few potential biases.

Something that gives me more confidence than I'd otherwise have is that I avoided reading independent evaluations of prediction methodologies until after I did the evaluations for this post and wrote 98% of the post and, on reading other people's evaluations, I found that I generally agreed with Tetlock's Superforecasting on what worked and what didn't work despite using a wildly different data set.

In particular, people who were into "big ideas" who use a few big hammers on every prediction combined with a cocktail party idea level of understanding of the particular subject to explain why a prediction about the subject would fall to the big hammer generally fared poorly, whether or not their favored big ideas were correct. Some examples of "big ideas" would be "environmental doomsday is coming and hyperconservation will pervade everything", "economic growth will create near-infinite wealth (soon)", "Moore's law is supremely important", "quantum mechanics is supremely important", etc. Another common trait of poor predictors is lack of anything resembling serious evaluation of past predictive errors, making improving their intuition or methods impossible (unless they do so in secret). Instead, poor predictors often pick a few predictions that were accurate or at least vaguely sounded similar to an accurate prediction and use those to sell their next generation of predictions to others.

By contrast, people who had (relatively) accurate predictions had a deep understanding of the problem and also tended to have a record of learning lessons from past predictive errors. Due to the differences in the data sets between this post and Tetlock's work, the details are quite different here. The predictors that I found to be relatively accurate had deep domain knowledge and, implicitly, had access to a huge amount of information that they filtered effectively in order to make good predictions. Tetlock was studying people who made predictions about a wide variety of areas that were, in general, outside of their areas of expertise, so what Tetlock found was that people really dug into the data and deeply understood the limitations of the data, which allowed them to make relatively accurate predictions. But, although the details of how people operated are different, at a high-level, the approach of really digging into specific knowledge was the same.

Because this post is so long, this post will contain a very short summary about each predictor followed by a moderately long summary on each predictor. Then we'll have a summary of what techniques and styles worked and what didn't work, with the full details of the prediction grading and comparisons to other evaluations of predictors in the appendix.

Ray Kurzweil: 7% accuracy
- Relies on: exponential or super exponential progress that is happening must continue; predicting the future based on past trends continuing; optimistic "rounding up" of facts and interpretations of data; panacea thinking about technologies and computers; cocktail party ideas on topics being predicted
Jacque Fresco: predictions mostly too far into the future to judge, but seems very low for judgeable predictions
- Relies on: panacea thinking about human nature, the scientific method, and computers; certainty that human values match Fresco's values
Buckminster Fuller: too few predictions to rate, but seems very low for judgeable predictions
- Relies on: cocktail party ideas on topics being predicted to an extent that's extreme even for a futurist
Michio Kaku: 3% accuracy
- Relies on: panacea thinking about "quantum", computers, and biotech; exponential progress of those
John Naisbitt: predictions too vague to score; mixed results in terms of big-picture accuracy, probably better than any futurist here other than Dixon, but this is not comparable to the percentages given for other predictors
- Relies on: trend prediction based on analysis of newspapers
Gerard K. O'Neill: predictions mostly too far into the future to judge, but seems very low for judgeable predictions
- Relies on: doing the opposite of what other futurists had done incorrectly, could be described as "trying to buy low and sell high" based on looking at prices that had gone up a lot recently; optimistic "rounding up" of facts and interpretations of data in areas O'Neill views as underrated; cocktail party ideas on topics being predicted
Patrick Dixon: 10% accuracy; also much better at "big picture" predictions than any other futurist here (but not in the same league as non-futurist predictors such as Yegge, Gates, etc.)
- Relies on: extrapolating existing trends (but with much less optimistic "rounding up" than almost any other futurist here); exponential progress; stark divide between "second millennial thinking" and "third millennial thinking"
Alvin Toffler: predictions mostly too vague to score; of non-vague predictions, Toffler had an incredible knack for naming a trend as very important and likely to continue right when it was about to stop
- Relies on: exponential progress that is happening must continue; a medley of cocktail party ideas inspired by speculation about what exponential progress will bring
Steve Yegge: 50% accuracy; general vision of the future generally quite accurate
- Relies on: deep domain knowledge, font of information flowing into Amazon and Google; looking at what's trending
Bryan Caplan: 100% accuracy
- Relies on: taking the "other side" of bad bets/predictions people make and mostly relying on making very conservative predictions
Bill Gates / Nathan Myhrvold / old MS leadership: timeframe of predictions too vague to score, but uncanny accuracy on a vision of the future as well as the relative importance of various technologies
- Relies on: deep domain knowledge, discussions between many people with deep domain knowledge, font of information flowing into Microsoft

Ray Kurzweil

Ray Kurzweil has claimed to have an 86% accuracy rate on his predictions, a claim which is often repeated, such as by Peter Diamandis where he says:

Of the 147 predictions that Kurzweil has made since the 1990's, fully 115 of them have turned out to be correct, and another 12 have turned out to be "essentially correct" (off by a year or two), giving his predictions a stunning 86% accuracy rate.

The article is titled "A Google Exec Just Claimed The Singularity Will Happen by 2029" opens with "Ray Kurzweil, Google's Director of Engineering, is a well-known futurist with a high-hitting track record for accurate predictions." and it cites this list of predictions on wikipedia. 86% is an astoundingly good track record for non-obvious, major, predictions about the future. This claim seems to be the source of other people claiming that Kurzweil has a high accuracy rate, such as here and here. I checked the accuracy rate of the wikipedia list Diamandis cited myself (using archive.org to get the list from when his article was published) and found a somewhat lower accuracy of 7%.

Fundamentally, the thing that derailed so many of Kurzweil's predictions is that he relied on the idea of exponential and accelerating growth in basically every area he can imagine, and even in a number of areas that have had major growth, the growth didn't keep pace with his expectations. His basic thesis is that not only do we have exponential growth due to progress (improve technologically, etc.), improvement in technology feeds back into itself, causing an increase in the rate of exponential growth, so we have double exponential growth (as in e^x^x, not 2*e^x) in many important areas, such as computer performance. He repeatedly talks about this unstoppable exponential or super exponential growth, e.g., in his 1990 book, The Age of Intelligent Machines, he says "One reliable prediction we can make about the future is that the pace of change will continue to accelerate" and he discusses this again in his 1999 book, The Age of Spiritual Machines, his 2001 essay on accelerating technological growth, titled "The Law of Accelerating Returns", his 2005 book, The Singularity is Near, etc.

One thing that's notable is despite the vast majority of his falsifiable predictions from earlier work being false, Kurzweil continues to use the same methodology to generate new predictions each time, which is reminiscent of Andrew Gelman's discussion of forecasters who repeatedly forecast the same thing over and over again in the face of evidence that their old forecasts were wrong. For example, in his 2005 The Singularity is Near, Kurzweil notes the existence of "S-curves", where growth from any particular "thing" isn't necessarily exponential, but, as he did in 1990, concludes that exponential growth will continue because some new technology will inevitably be invented which will cause exponential growth to continue and that "The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them".

In 2001, he uses this method to plot a graph and then predicts unbounded life expectancy by 2011 (the quote below isn't unambiguous on life expectancy being unbounded, but it's unambiguous if you read the entire essay or his clarification on his life expectancy predictions, where he says "I don’t mean life expectancy based on your birthdate, but rather your remaining life expectancy"):

Most of you (again I’m using the plural form of the word) are likely to be around to see the Singularity. The expanding human life span is another one of those exponential trends. In the eighteenth century, we added a few days every year to human longevity; during the nineteenth century we added a couple of weeks each year; and now we’re adding almost a half a year every year. With the revolutions in genomics, proteomics, rational drug design, therapeutic cloning of our own organs and tissues, and related developments in bio-information sciences, we will be adding more than a year every year within ten years.

Kurzweil pushes the date this is expected to happen back by more than one year per year (the last citation I saw on this was a 2016 prediction that we would have unbounded life expectancy by 2029), which is characteristic of many of Kurzweil's predictions.

Quite a few people have said that Kurzweil's methodology is absurd because exponential growth can't continue indefinitely in the real world, but Kurzweil explains why he believes this is untrue in his 1990 book, The Age of Intelligent Machines:

A remarkable aspect of this new technology is that it uses almost no natural resources. Silicon chips use infinitesimal amounts of sand and other readily available materials. They use insignificant amounts of electricity. As computers grow smaller and smaller, the material resources utilized are becoming an inconsequential portion of their value. Indeed, software uses virtually no resources at all.

That we're entering a world of natural resource abundance because resources and power are irrelevant to computers hasn't been correct so far, but luckily for Kurzweil, many of the exponential and double exponential processes he predicted would continue indefinitely stopped long before natural resource limits would come into play, so this wasn't a major reason Kurzweil's predictions have been wrong, although it would be if his predictions were less inaccurate.

At a meta level, one issue with Kurzweil's methodology is that he has a propensity to "round up" to make growth look faster than it is in order to fit the world to his model. For example, in "The Law of Accelerating Returns", we noted that Kurzweil predicted unbounded lifespan by 2011 based on accelerating lifespan when "now we’re adding almost a half a year every year" in 2001. However, life expectancy growth in the U.S. (which, based on his comments, seems to be most of what Kurzweil writes about) was only 0.2 years per year overall and 0.1 years per year in longer lived demographics and worldwide life expectancy was 0.3 years per year. While it's technically true that you can round 0.3 to 0.5 if you're rounding to the nearest 0.5, that's a very unreasonable thing to do when trying to guess when unbounded lifespan will happen because the high rate of worldwide increase life expectancy was mostly coming from "catch up growth" where there was a large reduction in things that caused "unnaturally" shortened lifespans.

If you want to predict what's going to happen at the high end, it makes more sense to look at high-end lifespans, which were increasing much more slowly. Another way in which Kurzweil rounded up to get his optimistic prediction was to select a framing that made it look like we were seeing extremely rapid growth in life expectancies. But if we simply plot life expectancy over time since, say, 1950, we can see that growth is mostly linear-ish trending to sub-linear (and this is true even if we cut the graph off when Kurzweil was writing in 2001), with some super-linear periods that trend down to sub-linear. Kurzweil says he's a fan of using indexes, etc., to look at growth curves, but in this case where he can easily do so, he instead chooses to pick some numbers out of the air because his "standard" methodology of looking at the growth curves results in a fairly boring prediction of lifespan growth slowing down, so there are three kinds of rounding up in play here (picking an unreasonably optimistic number, rounding up that number, and then selectively not plotting a bunch of points on the time series to paint the picture Kurzweil wants to present).

Kurzweil's "rounding up" is also how he came up with the predictions that, among other things, computer performance/size/cost and economic growth would follow double exponential trajectories. For computer cost / transistor size, Kurzweil plotted, on a log scale, a number of points on the silicon scaling curve, plus one very old point from the pre-silicon days, when transistor size was on a different scaling curve. He then fits what appears to be a cubic to this, and since a cubic "wants to" either have high growth or high anti-growth in the future, and the pre-silicon point puts pulls the cubic fit very far down in the past, the cubic fit must "want to" go up in the future and Kurzweil rounds up this cubic growth to exponential. This was also very weakly supported by the transistor scaling curve at the time Kurzweil was writing. As someone who was following ITRS roadmaps at the time, my recollection is that ITRS set a predicted Moore's law scaling curve and semiconductor companies raced to beat curve, briefly allowing what appeared to be super-exponential scaling since they would consistently beat the roadmap, which was indexed against Moore's law. However, anyone who actually looked at the details of what was going on or talked to semiconductor engineers instead of just looking at the scaling curve would've known that people generally expected both that super-exponential scaling was temporary and not sustainable and that the end of Dennard scaling as well as transistor-delay dominated (as opposed to interconnect delay-dominated) high-performance processors were imminent, meaning that exponential scaling of transistor sizes would not lead to the historical computer performance gains that had previously accompanied transistor scaling; this expectation was so widespread that it was discussed in undergraduate classes at the time. Anyone who spent even the briefest amount of time looking into semiconductor scaling would've known these things at the time Kurzweil was talking about how we were entering an era of double exponential scaling and would've thought that we would be lucky to even having general single exponential scaling of computer performance, but since Kurzweil looks at the general shape of the curve and not the mechanism, none of this knowledge informed his predictions, and since Kurzweil rounds up the available evidence to support his ideas about accelerating acceleration of growth, he was able to find a selected set of data points that supported the curve fit he was looking for.

We'll see this kind of rounding up done by other futurists discussed here, as well as longtermists discussed in the appendix, and we'll also see some of the same themes over and over again, particularly exponential growth and the idea that exponential growth will lead to even faster exponential growth due to improvements in technology causing an acceleration of the rate at which technology improves.

Jacque Fresco

In 1969, Jacque Fresco wrote Looking Forward. Fresco claims it's possible to predict the future by knowing what values people will have in the future and then using that to derive what the future will look like. Fresco doesn't describe how one can know the values people will have in the future and assumes people will have the values he has, which one might describe as 60s/70s hippy values. Another major mechanism he uses to predict the future is the idea that people of the future will be more scientific and apply the scientific method.

He writes about how "the scientific method" is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply "the scientific method" in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have "a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking".

This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures.

The third major mechanism Fresco uses to derive his predictions is the idea that computers will be able solve basically any problem one can imagine and that manufacturing technology will also progress similarly.

Each of the major mechanisms that are in play in Fresco's predictions are indistinguishable from magic. If you can imagine a problem in the domain, the mechanism is able to solve it. There are other magical mechanisms in play as well, generally what was in the air at the time. For example, behaviorism and operant conditioning were very trendy at the time, so Fresco assumes that society at large will be able to operant condition itself out of any social problems that might exist.

Although most of Fresco's predictions are technically not yet judgable because they're about the far future, for the predictions he makes whose time has come, I didn't see one accurate prediction.

Buckminster Fuller

Fuller is best known for inventing the geodesic dome, although geodesic domes were actually made by Walther Bauersfeld decades before Fuller "invented" them. Fuller is also known for a variety of other creations, like the Dymaxion car, as well as his futurist predictions.

I couldn't find a great source of a very long list of predictions from Fuller, but I did find this interview, where he makes a number of predictions. Fuller basically free associates with words, making predictions by riffing off of the English meaning of the word (e.g., see the teleportation prediction) or sometimes an even vaguer link.

Predictions from the video:

We'll be able to send people by radio because atoms have frequencies and radio waves have frequencies so it will be possible to pick up all of our frequencies and send them by radio
Undeveloped countries (as opposed to highly developed countries) will be able to get the most advanced technologies "via the moon"
- We're going to put people on the moon for a year, which will require putting something like mile diameter of earth activity into a little black box weighing 500 lbs so that the moon person will be able to operate locally as if they were on earth
- This will result in everyone realizing they could just get a little black box and they'll no longer need local sewer systems, water, power, etc.
Humans will be fully automated out of physical work
- The production capability of China and India will be irrelevant and the only thing that will matter is who can "get" the consumers from China and India
There will be a realistic accounting system of what wealth is, which is really about energy due to the law of conservation of energy, which also means that wealth won't deteriorate and get lost
- Wealth can only increase because energy can't be created or destroyed and when you do an experiment, you can only learn more, so wealth can only be created
- This will make the entire world successful

For those who've heard that Fuller predicted the creation of Bitcoin, that last prediction about an accounting system for wealth is the one people are referring to. Typically, people who say this haven't actually listened to the interview where he states the whole prediction and are themselves using Fuller's free association method. Bitcoin comes from spending energy to mine Bitcoin and Fuller predicted that the future would have a system of wealth based on energy, therefore Fuller predicted the creation of Bitcoin. If you actually listen to the interview, Bitcoin doesn't even come close to satisfying the properties of the system Fuller describes, but that doesn't matter if you're doing Fuller-style free association.

In this post, Fuller has fewer predictions graded than almost anyone else, so it's unclear what his accuracy would be if we had a list of, say, 100 predictions, but the predictions I could find have a 0% accuracy rate.

Michio Kaku

Among people on Wikipedia's futurist list, Michio Kaku is probably relatively well known because, as part of his work on science popularization, he's had a nationally (U.S.) syndicated radio show since 2006 and he frequently appears on talk shows and is interviewed by news organizations.

In his 1997 book, Visions: How Science Will Revolutionize the 21st Century, Kaku explains why predictions from other futurists haven't been very accurate and why his predictions are different:

... most predictions of the future have floundered because they have reflected the eccentric, often narrow viewpoints of a single individual.

The same is not true of Visions. In the course of writing numerous books, articles, and science commentaries, I have had the rare privilege of interviewing over 150 scientists from various disciplines during a ten-year period.

On the basis of these interviews, I have tried to be careful to delineate the time frame over which certain predictions will or will not be realized. Scientists expect some predictions to come about by the year 2020; others will not materialize until much later—from 2050 to the year 2100.

Kaku also claims that his predictions are more accurate than many other futurists because he's a physicist and thinking about things in the ways that physicists do allows for accurate predictions of the future:

It is, I think, an important distinction between Visions, which concerns an emerging consensus among the scientists themselves, and the predictions in the popular press made almost exclusively by writers, journalists, sociologists, science fiction writers, and others who are consumers of technology, rather than by those who have helped to shape and create it. ... As a research physicist, I believe that physicists have been particularly successful at predicting the broad outlines of the future. Professionally, I work in one of the most fundamental areas of physics, the quest to complete Einstein’s dream of a “theory of everything.” As a result, I am constantly reminded of the ways in which quantum physics touches many of the key discoveries that shaped the twentieth century.

In the past, the track record of physicists has been formidable: we have been intimately involved with introducing a host of pivotal inventions (TV, radio, radar, X-rays, the transistor, the computer, the laser, the atomic bomb), decoding the DNA molecule, opening new dimensions in probing the body with PET, MRI, and CAT scans, and even designing the Internet and the World Wide Web.

He also specifically calls out Kurzweil's predictions as absurd, saying Kurzweil has "preposterous predictions about the decades ahead, from vacationing on Mars to banishing all diseases."

Although Kaku finds Kurzweil's predictions ridiculous, his predictions rely on some of the same mechanics Kurzweil relies on. For example, Kaku assumes that materials / commodity prices will tank in the then-near future because the advance of technology will make raw materials less important and Kaku also assumes the performance and cost scaling of computer chips would continue on the historical path it was on during the 70s and 80s. Like most of the other futurists from Wikipedia's list, Kaku also assumed that the pace of scientific progress would rapidly increase, although his reasons are different (he cites increased synergy between the important fields of quantum mechanics, computer science, and biology, which he says are so important that "it will be difficult to be a research scientist in the future without having some working knowledge of" all of those fields).

Kaku assumed that UV lithography would run out of steam and that we'd have to switch to X-ray or electron lithography, which would then run out of steam, requiring us to switch to a fundamentally different substrate for computers (optical, molecular, or DNA) to keep performance and scaling on track, but advances in other fundamental computing substrates have not materialized quickly enough for Kaku's predictions to come to pass. Kaku assigned very high weight to things that have what he considers "quantum" effects, which is why, for example, he cites the microprocessor as something that will be obsolete by 2020 (they're not "quantum") whereas fiber optics will not be obsolete (they rely on "quantum" mechanisms). Although Kaku pans other futurists for making predictions without having a real understanding of the topics they're discussing, it's not clear that Kaku has a better understanding of many of the topics being discussed even though, as a physicist, Kaku has more relevant background knowledge.

The combination of assumptions above that didn't pan out leads to a fairly low accuracy rate for Kaku's predictions in Visions.

I didn't finish Visions, but the prediction accuracy rate of the part of the book I read (from the beginning until somewhere in the middle, to avoid cherry picking) was 3% (arguably 6% if you give full credit to the prediction I gave half credit to). He made quite a few predictions I didn't score in which he said something "may" happen. Such a prediction is, of course, unfalsifiable because the statement is true whether or not the event happens.

John Naisbitt

Anyone who's a regular used book store bargain bin shopper will have seen this name on the cover of Megatrends, which must be up there with Lee Iacocca's autobiography as one of the most common bargain bin fillers.

Naisbitt claims that he's able to accurately predict the future using "content analysis" of newspapers, which he says was used to provide great insights during WWII and has been widely used by the intelligence community since then, but hadn't been commercially applied until he did it. Naisbitt explains that this works because there's a fixed amount of space in newspapers (apparently newspapers can't be created or destroyed nor can newspapers decide to print significantly more or less news or have editorial shifts in what they decide to print that are not reflected by identical changes in society at large):

Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.

Unfortunately, it's not really possible to judge Naisbitt's predictions because he almost exclusively deals in vague, horoscope-like, predictions which can't really be judged as correct or incorrect. If you just read Megatrends for the flavor of each chapter and don't try to pick out individual predictions, some chapters seem quite good, e.g., "Industrial Society -> Information Society", but some are decidedly mixed even if you very generously grade his vague predictions, e.g., "From Forced Technology to High Tech / High Touch". This can't really be compared to the other futurists in this post because it's much easier to make vague predictions sound roughly correct than to make precise predictions correct but, even so, if reading for general feel of what direction the future might go, Naisbitt's predictions are much more on the mark than any other futurists discussed.

That being said, as far as I read in his book, the one concrete prediction I could find was incorrect, so if you want to score Naisbitt comparably to the other futurists discussed here, you might say his accuracy rate is 0% but with very wide error bars.

Gerard K. O'Neill

O'Neill has two relatively well-known non-fiction futurist books, 2081 and The Technology Edge. 2081 was written in 1980 and predicts the future 100 years from then. The Technology Edge discusses what O'Neill thought the U.S. needed to do in 1983 to avoid being obsoleted by Japan.

O'Neill spends a lot more space on discussing why previous futurists were wrong than any other futurist under discussion. O'Neill notes that "most [futurists] overestimated how much the world would be transformed by social and political change and underestimated the forces of technological change" and cites Kipling, Verne, Wells, Haldane, and Ballamy, as people who did this. O'Neill also says that "scientists tend to overestimate the chances for major scientific breakthroughs and underestimate the effects of straightforward developments well within the boundaries of existing knowledge" and cites Haldane again on this one. O'Neill also cites spaceflight as a major miss of futurists past, saying that they tended to underestimate how quickly spaceflight was going to develop.

O'Neill also says that it's possible to predict the future without knowing the exact mechanism by which the change will occur. For example, he claims that the automobile could've been safely predicted even if the internal combustion engine hadn't been invented because steam would've also worked. But he also goes on to say that there are things it would've been unreasonable to predict, like the radio, TV, and electronic communications, saying that even though the foundations for those were discovered in 1865 and that the time interval between a foundational discovery and its application is "usually quite long", citing 30-50 years from quantum mechanics to integrated circuits and 100+ years from relativity to faster than light travel, and 50+ years from the invention of nuclear power without "a profound impact".

I don't think O'Neill ever really explains why his predictions are of the "automobile" kind in a convincing way. Instead, he relies on doing the opposite of what he sees as mistakes others made. The result is that he predicts huge advancements in space flight, saying we should expect we should expect large scale space travel and colonization by 2081, presaged by wireless transmission of energy by 2000 (referring to energy beamed down from satellites) and interstellar probes by 2025 (presumably something of a different class than the Voyager probes, which were sent out in 1977).

In 1981, he said "a fleet of reusable vehicles of 1990s vintage, numbering much less than today's world fleet of commercial jet transports, would be quite enough to provide transport into space and back again for several hundred million people per year", predicting that something much more advanced the the NASA Space Shuttle would be produced shortly afterwards. Continuing that progress "by the year 2010 or thereabouts there will be many space colonies in existence and many new ones being constructed each year".

Most of O'Neill's predictions are for 2081, but he does make the occasional prediction for a time before 1981. All of the falsifiable ones I could find were incorrect, giving him an accuracy rate of approximately 0% but with fairly wide error bars.

Patrick Dixon

Dixon is best known for writing Futurewise, but he has quite a few books with predictions about the future. In this post, we're just going to look at Futurewise, because it's the most prediction-oriented book Dixon has that's old enough that we ought to be able to make a call on a decent number of his predictions (Futurewise is from 1998; his other obvious candidate, The Future of Almost Everything is from 2015 and looks forward a century).

Unlike most other futurists featured in this post, Dixon doesn't explicitly lay out why you should trust his predictions in Futurewise in the book itself, although he sort of implicitly does so in the acknowledgements, where he mentions having interacted with many very important people.

I am indebted to the hundreds of senior executives who have shaped this book by their participation in presentations on the Six Faces of the Future. The content has been forged in the realities of their own experience.

And although he doesn't explicitly refer to himself, he also says that business success will come from listening to folks who have great vision:

Those who are often right will make a fortune. Trend hunting in the future will be a far cry from the seventies or eighties, when everything was more certain. In a globalized market there are too many variables for back-projection and forward-projection to work reliably .. That's why economists don't make good futurologists when it comes to new technologies, and why so many boards of large corporations are in such a mess when it comes to quantum leaps in thinking beyond 2000.

Second millennial thinking will never get us there ... A senior board member of a Fortune 1000 company told me recently: 'I'm glad I'm retiring so I don't have to face these decisions' ... 'What can we do?' another senior executive declares ...

Later, in The Future of Almost Everything, Dixon lays out the techniques that he says worked when he wrote Futurewise, which "has stood the test of time for more than 17 years". Dixon says:

All reliable, long-range forecasting is based on powerful megatrends that have been driving profound, consistent and therefore relatively predictable change over the last 30 years. Such trends are the basis of every well- constructed corporate strategy and government policy ... These wider trends have been obvious to most trend analysts like myself for a while, and have been well described over the last 20–30 years. They have evolved much more slowly than booms and busts, or social fads.

And lays out trends such as:

fall in costs of production of most mass-produced items
increased concern about environment/sustainability
fall in price of digital technology, telecoms and networking
rapid growth of all kinds of wireless/mobile devices
ever-larger global corporations, mergers, consolidations

Dixon declines to mention trends he predicted that didn't come to pass (such as his prediction that increased tribalism will mean that most new wealth is created in small firms of 20 or fewer employees which will mostly be family owned, or his prediction that the death of "old economics" means that we'll be able to have high economic growth with low unemployment and no inflationary pressure indefinitely), or cases where the trend progression caused Dixon's prediction to be wildly incorrect, a common problem when making predictions off of exponential trends because a relatively small inaccuracy in the rate of change can result in a very large change in the final state.

Dixon's website is full of endorsements for him, with implicit and explicit claims that he's a great predictor of the future, as well as more general statements such as "Patrick Dixon has been ranked as one of the 20 most influential business thinkers alive today".

Back in Futurewise, Dixon relies heavily on the idea of a stark divide between "second millennial thinking" and "third millennial thinking" repeatedly comes up in Dixon's text. Like nearly everyone else under discussion, Dixon also extrapolates out from many existing trends to make predictions that didn't pan out, e.g., he looked at the falling cost and decreasing price of phone lines and predicted that people would end up with a huge number of phone lines in their home by 2005 and that screens getting thinner would mean that we'd have "paper-thin display sheets" in significant use by 2005. This kind of extrapolation sometimes works and Dixon's overall accuracy rate of 10% is quite good compared to the other "futurists" under discussion here.

However, when Dixon explains his reasoning in areas I have some understanding of, he seems to be operating at the buzzword level, so that when he makes a correct call, it's generally for the wrong reasons. For example, Dixon says that software will always be buggy, which seems true, at least to date. However, his reasoning for this is that new computers come out so frequently (he says "less than 20 months" — a reference to the 18 month timeline in Moore's law) and it takes so long to write good software ("at least 20 years") that programmers will always be too busy rewriting software to run on the new generation of machines (due to the age of the book, he uses the example of "brand new code ... written for Pentium chips").

It's simply not the case that most bugs or even, as a fraction of bugs, almost any bugs are due to programmers rewriting existing code to run on new CPUs. If you really squint, you can see things like Android devices having lots of security bugs due to the difficulty of updating Android and backporting changes to older hardware, but those kinds of bugs are both a small fraction of all bugs and not really what Dixon was talking about.

Similarly, on how computer backups will be done in the future, Dixon basically correctly says that home workers will be vulnerable to data loss and people who are serious about saving data will back up data online, "back up data on-line to computers in other cities as the ultimate security".

But Dixon's stated reason for this is that workstations already have large disk capacity (>= 2GB) and floppy disks haven't kept up (< 2MB), so it would take thousands of floppy disks to do backups, which is clearly absurd. However, even at the time, Zip drives (100MB per portable disk) were common and, although it didn't take off, the same company that made Zip drives also made 1GB "Jaz" drives. And, of course, tape backup was also used at the time and is still used today. This trend has continued to this day; large, portable, disks are available, and quite a few people I know transfer or back up large amounts of data on portable disks. The reason most people don't do disk/tape backups isn't that it would require thousands of disks to backup a local computer (if you look at the computers people typically use at home, most people could back up their data onto a single portable disk per failure domain and even keep multiple versions on one disk), but that online/cloud backups are more convenient.

Since Dixon's reasoning was incorrect (at least in the cases where I'm close enough to the topic to understand how applicable the reasoning was), it seems that when Dixon is correct, it can't be for the stated reason and Dixon is either correct by coincidence or because he's looking at the broader trend and came up with an incorrect rationalization for the prediction. But, per the above, it's very difficult to actually correctly predict the growth rate of a trend over time, so without some understanding of the mechanics in play, one could also say that a prediction that comes true based on some rough trend is also correct by coincidence.

Alvin Toffler / Heidi Toffler

Like most others on this list, Toffler claims some big prediction wins

The Tofflers claimed on their website to have foretold the breakup of the Soviet Union, the reunification of Germany and the rise of the Asia-Pacific region. He said in the People’s Daily interview that “Future Shock” envisioned cable television, video recording, virtual reality and smaller U.S. families.

In this post, we'll look at Future Shock, Toffler's most famous work, written in 1970.

According to a number of sources, Alvin Toffler's major works were co-authored by Heidi Toffler. In the books themselves, Heidi Toffler is acknowledged as someone who helped out a lot, but not as an author, despite the remarks elsewhere about co-authorship. In this section, I'm going to refer to Toffler in the singular, but you may want to mentally substitute the plural.

Toffler claims that we should understand the present not only by understanding the past, but also by understanding the future:

Previously, men studied the past to shed light on the present. I have turned the time-mirror around, convinced that a coherent image of the future can also shower us with valuable insights into today. We shall find it increasingly difficult to understand our personal and public problems without making use of the future as an intellectual tool. In the pages ahead, I deliberately exploit this tool to show what it can do.

Toffler generally makes vague, wish-y wash-y statements, so it's not really reasonable to score Toffler's concrete predictions because so few predictions are given. However, Toffler very strongly implies that past exponential trends are expected to continue or even accelerate and that the very rapid change caused by this is going to give rise to "future shock", hence the book's title:

I coined the term "future shock" to describe the shattering stress and disorientation that we induce in individuals by subjecting them to too much change in too short a time. Fascinated by this concept, I spent the next five years visiting scores of universities, research centers, laboratories, and government agencies, reading countless articles and scientific papers and interviewing literally hundreds of experts on different aspects of change, coping behavior, and the future. Nobel prizewinners, hippies, psychiatrists, physicians, businessmen, professional futurists, philosophers, and educators gave voice to their concern over change, their anxieties about adaptation, their fears about the future. I came away from this experience with two disturbing convictions. First, it became clear that future shock is no longer a distantly potential danger, but a real sickness from which increasingly large numbers already suffer. This psycho-biological condition can be described in medical and psychiatric terms. It is the disease of change .. Earnest intellectuals talk bravely about "educating for change" or "preparing people for the future." But we know virtually nothing about how to do it ... The purpose of this book, therefore, is to help us come to terms with the future— to help us cope more effectively with both personal and social change by deepening our understanding of how men respond to it

The big hammer that Toffler uses everywhere is extrapolation of exponential growth, with the implication that this is expected to continue. On the general concept of extrapolating out from curves, Toffler's position is very similar to Kurzweil's: if you can see a trend on a graph, you can use that to predict the future, and the ability of technology to accelerate the development of new technology will cause innovation to happen even more rapidly than you might naively expect:

Plotted on a graph, the line representing progress in the past generation would leap vertically off the page. Whether we examine distances traveled, altitudes reached, minerals mined, or explosive power harnessed, the same accelerative trend is obvious. The pattern, here and in a thousand other statistical series, is absolutely clear and unmistakable. Millennia or centuries go by, and then, in our own times, a sudden bursting of the limits, a fantastic spurt forward. The reason for this is that technology feeds on itself. Technology makes more technology possible, as we can see if we look for a moment at the process of innovation. Technological innovation consists of three stages, linked together into a self-reinforcing cycle. ... Today there is evidence that the time between each of the steps in this cycle has been shortened. Thus it is not merely true, as frequently noted, that 90 percent of all the scientists who ever lived are now alive, and that new scientific discoveries are being made every day. These new ideas are put to work much more quickly than ever before.

The first N major examples of this from the book are:

Population growth rate (doubling time of 11 years), which will have to create major changes
Economic growth (doubling time of 15 years), which will increase the amount of stuff people own (this is specifically phrased as amount of stuff and not wealth)
- It's very strongly implied that this will continue for at least 70 years
Speed of travel; no doubling time is stated, but the reader is invited to extrapolate from the following points: human running speed millions of years ago, 100 mph in the 1880s, 400 mph in 1938, 800 mph by 1958, 4000 mph very shortly afterwards (18000 mph when orbiting the earth)
Reduced time from conception of an idea to the application, used to support the idea that growth will accelerate

As we just noted above, when discussing Dixon, Kurzweil, etc., predicting the future by extrapolating out exponential growth is fraught. Toffler somehow manages to pull off the anti-predictive feat of naming a bunch of trends which were about to stop, some of which already had their writing on the wall when Toffler was writing.

Toffler then extrapolates from the above and predicts that the half-life of everything will become shorter, which will overturn how society operates in a variety of ways.

For example, companies and governments will replace bureaucracies with "adhocracies" sometime between 1995 and 2020 . The concern that people will feel like cogs as companies grow larger is obsolete because, in adhocracy, the entire concept of top-down command and control will disappear, obsoleted by the increased pace of everything causing top-down command and control structures to disappear. While it's true that some companies have less top-down direction than would've been expected in Toffler's time, many also have more, which has been enabled by technology allowing employers to keep stricter tabs on employees than ever before, making people more of a cog than ever before.

Another example is that Toffler predicted human colonization of the Ocean, "The New Atlantis", "long before the arrival of A.D. 2000".

Fabian Giesen points out that, independent of the accuracy of Toffler's predictions, Venkatesh Rao's Welcome to the Future Nauseous explains why "future shock" didn't happen in areas of very rapid technological development.

People from the Wikipedia list who weren't included

Laurie Anderson
- I couldn't easily find predictions from her, except some song lyrics that allegedly predicted 9/11, but in a very "horoscope" sort of way
Arthur Harkins
- His Wikipedia entry was later removed for notability reasons and it was already tagged as non-notable at the time
Stephen Hawking
- The predictions I could find are generally too far out to grade and are really more suggestions as to what people should do than predictions. For example the Wikipedia futurist list above links to a 2001 prediction that humans will be left behind by computers / robots if genetic engineering wasn't done to allow humans to keep up and it also links to a 2006 prediction that humans need to expand to other planets to protect the species
Thorkil Kristensen
- I couldn't easily find a set of English language predictions from Kristensen. Thorkil Kristensen is associated with but not an author of The Limits to Growth, a 1970s anti-growth polemic
David Sears
- Not notable enough to have a wikipedia page, then or now
John Zerzan
- Zerzan seems like more of someone who's calling for change in society due to his political views than a "futurist" who's trying to predict the future

Steve Yegge

As I mentioned at the start, none of the futurists from Wikipedia's list had very accurate predictions, so we're going to look at a couple other people from other sources who aren't generally considered futurists to see how they rank.

We previously looked at Yegge's predictions here, which were written in 2004 and were generally about the next 5-10 years, with some further out. There were nine predictions (technically ten, but one isn't really a prediction). If grading them as written, which is how futurists have been scored, I would rank these at 4.5/9, or about 50%.

You might argue that this is unfair because Yegge was predicting the relatively near future, but if we look at relatively near future predictions from futurists, their accuracy rate is generally nowhere near 50%, so I don't think it's unfair to compare the number in some way.

If you want to score these like people often score futurists, where they get credit for essentially getting things directionally correct, then I'd say that Yegge's score should be between 7/9 and 8/9, depending on how much partial credit he gets for one of the questions.

If you want to take a more holistic "what would the world look like if Yegge's vision were correct vs. the world we're in today", I think Yegge also does quite well there, with the big miss being that Lisp-based languages have not taken over the world, the success of Clojure notwithstanding. This is quite different than the futurists here, who generally had a vision of many giant changes that didn't come to pass, e.g., if we look at Kurzweil's vision of the world, by 2010, we would've had self-driving cars, a "cure" for paraplegia, widespread use of AR, etc., by 2011, we would have unbounded life expectancy, and by 2019 we would have pervasive use of nanotechnology including computers having switched from transistors to nanotubes, effective "mitigations" for blindness and deafness, fairly widely deployed fully realistic VR that can simulate sex via realistic full-body stimulation, pervasive self-driving cars (predicted again), entirely new fields of art and music, etc., and all that these things imply, which is a very different world than the world we actually live in.

And we see something similar if we look at other futurists, who predicted things like living underground, living under the ocean, etc.; most predicted many revolutionary changes that would really change society, a few of which came to pass. Yegge, instead, predicted quite a few moderate changes (as well as some places where change would be slower than a lot of people expected) and changes were slower than he expected in the areas he predicted, but only by a bit.

Yegge described his methodology for the post above as:

If you read a lot, you'll start to spot trends and undercurrents. You might see people talking more often about some theme or technology that you think is about to take off, or you'll just sense vaguely that some sort of tipping point is occurring in the industry. Or in your company, for that matter.

I seem to have many of my best insights as I'm writing about stuff I already know. It occurred to me that writing about trends that seem obvious and inevitable might help me surface a few not-so-obvious ones. So I decided to make some random predictions based on trends I've noticed, and see what turns up. It's basically a mental exercise in mining for insights

In this essay I'll make ten predictions based on undercurrents I've felt while reading techie stuff this year. As I write this paragraph, I have no idea yet what my ten predictions will be, except for the first one. It's an easy, obvious prediction, just to kick-start the creative thought process. Then I'll just throw out nine more, as they occur to me, and I'll try to justify them even if they sound crazy.

He's not really trying to generate the best predictions, but still did pretty well by relying on his domain knowledge plus some intuition about what he's seen.

In the post about Yegge's predictions, we also noted that he's made quite a few successful predictions outside of his predictions post:

Steve also has a number of posts that aren't explicitly about predictions that, nevertheless, make pretty solid predictions about how things are today, written way back in 2004. There's It's Not Software, which was years ahead of its time about how people write “software”, how writing server apps is really different from writing shrinkwrap software in a way that obsoletes a lot of previously solid advice, like Joel's dictum against rewrites, as well as how service oriented architectures look; the Google at Delphi (again from 2004) correctly predicts the importance of ML and AI as well as Google's very heavy investment in ML; an old interview where he predicts "web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/"; etc. A number of Steve's internal Google blog posts also make interesting predictions, but AFAIK those are confidential.

Quite a few of Yegge's predictions would've been considered fairly non-obvious at the time and he seemed to still have a fairly good success rate on his other predictions (although I didn't try to comprehensively find them and score them, I sampled some of his old posts and found the overall success rate to be similar to the ones in his predictions post).

With Yegge and the other predictors that were picked so that we can look at some accurate predictions there is, of course, a concern that there's survivorship bias in picking these predictors. I suspect that's not the case for Yegge because he continued to be accurate after I first noticed that he seemed to have accurate predictions, so it's not just that I picked someone who had a lucky streak after the fact. Also, especially with some of his Google internal G+ comments, made fairly high dimension comments that ended being right for the reasons he suggested, which provides a lot more information about how accurate his reasoning was than simply winning a bunch of coin flips in a row. This comment about depth of reasoning doesn't apply to Caplan, below, because I haven't evaluated Caplan's reasoning, but does apply to MS leadership circa 1990.

Bryan Caplan

Bryan Caplan reports that his track record is 23/23 = 100%. He is much more precise in specifying his predictions than anyone else we've looked at and tries to give a precise bet that will be trivial to adjudicate as well as betting odds.

Caplan started making predictions/bets around the time the concept that "betting is a tax on bullshit" became popular (the idea being that a lot of people are willing to say anything but will quiet down if asked to make a real bet and those that don't will pay a real cost if they make bad real bets) and Caplan seems to have a strategy as acting as a tax man on bullshit in that he generally takes the safe side of bets that people probably shouldn't have made. Andrew Gelman says:

Caplan’s bets are an interesting mix. The first one is a bet where he offered 1-to-100 odds so it’s no big surprise that he won, but most of them are at even odds. A couple of them he got lucky on (for example, he bet in 2008 that no large country would leave the European Union before January 1, 2020, so he just survived by one month on that one), but, hey, it’s ok to be lucky, and in any case even if he only had won 21 out of 23 bets, that would still be impressive.

It seems to me that Caplan’s trick here is to show good judgment on what pitches to swing at. People come at him with some strong, unrealistic opinions, and he’s been good at crystallizing these into bets. In poker terms, he waits till he has the nuts, or nearly so. 23 out of 23 . . . that’s a great record.

I think there's significant value in doing this, both in the general "betting is a tax on bullshit" sense as well as, more specifically, if you have high belief that someone is trying to take the other side of bad bets and has good judgment, knowing that the Caplan-esque bettor has taken the position gives you decent signal about the bet even if you have no particular expertise in the subject. For example, if you look at my bets, even though I sometimes take bets against obviously wrong positions, I much more frequently take bets I have a very good chance of losing, so just knowing that I took a bet provides much less information than knowing that Caplan took a bet.

But, of course, taking Caplan's side of a bet isn't foolproof. As Gelman noted, Caplan got lucky at least once, and Caplan also seems likely to lose the Caplan and Tabarrok v. Bauman bet on global temperature. For that particular bet, you could also make the case that he's expected to lose since he took the bet with 3:1 odds, but a lot of people would argue that 3:1 isn't nearly long enough odds to take that bet.

The methodology that Caplan has used to date will never result in a positive prediction of some big change until the change is very likely to happen, so this methodology can't really give you a vision of what the future will look like in the way that Yegge or Gates or another relatively accurate predictor who takes wilder bets could.

Bill Gates / Nathan Myhrvold / MS leadership circa 1990 to 1997

A handful of memos that were released to the world due to the case against Microsoft which laid out the vision Microsoft executives had about how the world would develop, with or without Microsoft's involvement. These memos don't lay out concrete predictions with timelines and therefore can't be scored in the same way futurist predictions were scored in this post. If rating these predictions on how accurate their vision of the future was, I'd rate them similarly to Steve Yegge (who scored 7/9 or 8/9), but the predictions were significantly more ambitious, so they seem much more impressive when controlling for the scope of the predictions.

Compared to the futurists we discussed, there are multiple ways in which the predictions are much more detailed (and therefore more impressive for a given level of accuracy on top of being more accurate). One is that MS execs have a much deeper understanding of the things under discussion and how they impact each other. With "our" futurists, they often discuss things at a high level and, when they discuss things in detail, they make statements that make it clear that they don't really understand the topic and often don't really know what the words they're writing mean. MS execs of the era pretty clearly had a deep understanding of the issues in play, which let them make detailed predictions that our futurists wouldn't make, e.g., while protocols like FTP and IRC will continue to be used, the near future of the internet is HTTP over TCP and the browser will become a "platform" in the same way that Windows is a "platform", one that's much more important and larger than any OS (unless Microsoft is successful in taking action to stop this from coming to pass, which it was not despite MS execs foreseeing the exact mechanisms that could cause MS to fail to own the internet). MS execs use this level of understanding to make predictions about the kinds of larger things that our futurists discuss, e.g., the nature of work and how that will change.

Actually having an understanding of the issues in play and not just operating with a typical futurist buzzword level understanding of the topics allowed MS leadership to make fairly good guesses about what the future would look like.

For a fun story about how much effort Gates spent on understanding what was going on, see this story by Joel Spolsky on his first Bill Gates review:

Bill turned to me.

I noticed that there were comments in the margins of my spec. He had read the first page!

He had read the first page of my spec and written little notes in the margin!

Considering that we only got him the spec about 24 hours earlier, he must have read it the night before.

He was asking questions. I was answering them. They were pretty easy, but I can’t for the life of me remember what they were, because I couldn’t stop noticing that he was flipping through the spec…

He was flipping through the spec! [Calm down, what are you a little girl?]

… [ed: ellipses are from the original doc] and THERE WERE NOTES IN ALL THE MARGINS. ON EVERY PAGE OF THE SPEC. HE HAD READ THE WHOLE GODDAMNED THING AND WRITTEN NOTES IN THE MARGINS.

He Read The Whole Thing! [OMG SQUEEE!]

The questions got harder and more detailed.

They seemed a little bit random. By now I was used to thinking of Bill as my buddy. He’s a nice guy! He read my spec! He probably just wants to ask me a few questions about the comments in the margins! I’ll open a bug in the bug tracker for each of his comments and makes sure it gets addressed, pronto!

Finally the killer question.

“I don’t know, you guys,” Bill said, “Is anyone really looking into all the details of how to do this? Like, all those date and time functions. Excel has so many date and time functions. Is Basic going to have the same functions? Will they all work the same way?”

“Yes,” I said, “except for January and February, 1900.”

Silence. ... “OK. Well, good work,” said Bill. He took his marked up copy of the spec ... and left

Gates (and some other MS execs) were very well informed about what was going on to a fairly high level of detail considering all of the big picture concerns they also had in mind.

A topic for another post is how MS leadership had a more effective vision for the future than leadership at old-line competitors (Novell, IBM, AT&T, Yahoo, Sun, etc.) and how this resulted in MS turning into a $2T company while their competitors became, at best, irrelevant and most didn't even succeed at becoming irrelevant and ceased to exist. Reading through old MS memos, it's clear that MS really kept tabs on what competitors were doing and they were often surprised at how ineffective leadership was at their competitors, e.g., on Novell, Bill Gates says "Our traditional competitors are just getting involved with the Internet. Novell is surprisingly absent given the importance of networking to their position"; Gates noted that Frankenberg, then-CEO of Novell, seemed to understand the importance of the internet, but Frankenberg only joined Novell in 1994 and left in 1996 and spent much of his time at Novell reversing the direction the company had taken under Noorda, which didn't leave Novell with a coherent position or plan when Frankenberg "resigned" two years into the pivot he was leading.

In many ways, a discussion of what tech execs at the time thought the future would look like and what paths would lead to success is more interesting than looking at futurists who basically don't understand the topics they're talking about, but I started this post to look at how well futurists understood the topics they discussed and I didn't know, in advance, that their understanding of the topics they discuss and resultant prediction accuracy would be so poor.

Common sources of futurist errors

Not learning from mistakes
- Good predictors tend to be serious at looking at failed past predictions and trying to calibrate
Reasoning from a cocktail party level understanding of a topic
- Good predictors tend to engage with ideas in detail
Pushing one or a few "big ideas"
Generally assuming high certainty about the future
- Worse yet: assuming high certainty of scaling curves, especially exponential scaling curves
Panacea thinking
Only seeing the upside (or downside) of technological changes
Starting from evidence-free assumptions

Not learning from mistakes

The futurists we looked at in this post tend to rate themselves quite highly and, after the fact, generally claim credit for being great predictors of the future, so much so that they'll even tell you how you can predict the future accurately. And yet, after scoring them, the most accurate futurist (among the ones who made concrete enough predictions that they could be scored) came in at 10% accuracy with generous grading that gave them credit for making predictions that accidentally turned out to be correct when they mispredicted the mechanism by which the prediction would come to pass (a strict reading of many of their predictions would reduce the accuracy because they said that the prediction would happen because of their predicted mechanism, which is false, rendering the prediction false).

There are two tricks that these futurists have used to be able to make such lofty claims. First, many of them make vague predictions and then claim credit if anything vaguely resembling the prediction comes to pass. Second, almost all of them make a lot of predictions and then only tally up the ones that came to pass. One way to look at a 4% accuracy rate is that you really shouldn't rely on that person's predictions. Another way is that, if they made 500 predictions, they're a great predictor because they made 20 accurate predictions. Since almost no one will bother to go through a list of predictions to figure out the overall accuracy when someone does the latter, making a huge number of predictions and then cherry picking the ones that were accurate is a good strategy for becoming a renowned futurist.

But if we want to figure out how to make accurate predictions, we'll have to look at other people's strategies. There are people who do make fairly good, generally directionally accurate, predictions, as we noted when we looked at Steve Yegge's prediction record. However, they tend to be harsh critics of their predictions, as Steve Yegge was when he reviewed his own prediction record, saying:

I saw the HN thread about Dan Luu's review of this post, and felt people were a little too generous with the scoring.

It's unsurprising that a relatively good predictor of the future scored himself lower than I did because taking a critical eye to your own mistakes and calling yourself out for mistakes that are too small for most people to care about is a great way to improve. We can see in communications from Microsoft leadership as well, e.g., calling themselves out for failing to predict that a lack of backwards compatibility doomed major efforts like OS/2 and LanMan. Doing what most futurists do and focusing on the predictions that worked out without looking at what went wrong isn't such a great way to improve.

Cocktail party understanding

Another thing we see among people who make generally directionally correct predictions, as in the Steve Yegge post mentioned above, Nathan Myhrvold's 1993 "Road Kill on the Information Highway", Bill Gates's 1995 "The Internet Tidal Wave", etc., is that the person making the prediction actually understands the topic. In all of the above examples, it's clear that the author of the document has a fairly strong technical understanding of the topics being predicted and, in the general case, it seems that people who have relatively accurate predictions are really trying to understand the topic, which is in stark contrast to the futurists discussed in this post, almost all of whom display clear signs of having a having a buzzword level understanding² of the topics they're discussing.

There's a sense in which it isn't too difficult to make correct predictions if you understand the topic and have access to the right data. Before joining a huge megacorp and then watching the future unfold, I thought documents like "Road Kill on the Information Highway" and "The Internet Tidal Wave" were eerily prescient, but once I joined Google in 2013, a lot of trends that weren't obvious from the outside seemed fairly obvious from the inside.

For example, it was obvious that mobile was very important for most classes of applications, so much so that most applications that were going to be successful would be "mobile first" applications where the web app was secondary, if it existed at all, and from the data available internally, this should've been obvious going back at least to 2010. Looking at what people were doing on the outside, quite a few startups in areas where mobile was critical were operating with a 2009 understanding of the future even as late as 2016 and 2017, where they focused on having a web app first and had no mobile app and a web app that was unusable on mobile. Another example of this is that, in 2012, quite a few people at Google independently wanted Google to make very large bets on deep learning. It seemed very obvious that deep learning was going to be a really big deal and that it was worth making a billion dollar investment in a portfolio of hardware that would accelerate Google's deep learning efforts.

This isn't to say that the problem is trivial — many people with access to the same data still generally make incorrect predictions. A famous example is Ballmer's prediction that "There’s no chance that the iPhone is going to get any significant market share. No chance."³ Ballmer and other MS leadership had access to information as good as MS leadership from a decade earlier, but many of their predictions were no better than the futurists we discussed here. And with the deep learning example above, a competitor with the same information at Google totally whiffed and kept whiffing for years, even with the benefit of years of extra information; they're still well behind Google now, a decade later, due to their failure to understand how to enable effective, practical, deep learning R&D.

Assuming high certainty

Another common cause of incorrect predictions was having high certainty. That's a general problem that's magnified when making predictions from looking at past exponential growth and extrapolating to the future both because mispredicting the timing of a large change in exponential growth can have a very large impact and also because relatively small sustained changes in exponential growth can also have a large impact. An example that exposed these weaknesses for a large fraction of our futurists was their interpretation of Moore's law, which many interpreted as a doubling of every good thing and/or halving of every bad thing related to computers every 18 months. That was never what Moore's law predicted in the first place, but it was a common pop-conception of Moore's law. One thing that's illustrative about that is that predictors who were writing in the late 90s and early 00s still made these fantastical Moore's law "based" predictions even though it was such common knowledge that both single-threaded computer performance and Moore's law would face significant challenges that this was taught in undergraduate classes at the time. Any futurist who spent a few minutes talking to an expert in the area or even an undergrad would've seen that there's a high degree of uncertainty about computer performance scaling, but most of the futurists we discuss either don't do that or ignore evidence that would add uncertainty to their narrative⁴.

As computing power increases, all constant-factor inefficiencies ("uses twice as much RAM", "takes three times as many RISC operations") tend to be ground under the heel of Moore's Law, leaving polynomial and exponentially increasing costs as the sole legitimate areas of concern. Flare, then, is willing to accept any O(C) inefficiency (single, one-time cost), and is willing to accept most O(N) inefficiencies (constant-factor costs), because neither of these costs impacts scalability; Flare programs and program spaces can grow without such costs increasing in relative significance. You can throw hardware at an O(N) problem as N increases; throwing hardware at an O(N**2) problem rapidly becomes prohibitively expensive.

For computer scaling in particular, it would've been possible to make a reasonable prediction about 2022 computers in, say, 2000, but it would've had to have been a prediction about the distribution of outcomes which had a lot of weight on severely reduced performance gains in the future with some weight on a portfolio of possibilities that could've resulted in continued large gains. Someone making such a prediction would've had to, implicitly or explicitly, been familiar with ITRS semiconductor scaling roadmaps of the era as well as recent causes of recent misses (my recollection from reading roadmaps back then was that, in the short term, companies had actually exceeded recent scaling predictions, but via mechanisms that were not expected to be scalable into the future) as well as things that could unexpectedly keep semiconductor scaling on track. Furthermore, such a predictor would also have to be able to evaluate architectural ideas that might have panned out to rule them out or assign them a low probability, such as dataflow processors, the basket of techniques people were working on in order to increase ILP in order an attempt to move from the regime Tjaden and Flynn discussed in their classic 1970 and 1973 papers on ILP to the something closer to the bound discussed by Riseman and Foster in 1972 and later by Nicolau and Fisher in 1984, etc.

Such a prediction would be painstaking work for someone who isn't in the field because of the sheer number of different things that could have impacted computer scaling. Instead of doing this, futurists relied heavily on the pop-understanding they had about semiconductors. Kaku was notable among futurists under discussion for taking seriously the idea that Moore's law wasn't smooth sailing in the future, but he incorrectly decided on when UV/EUV would run out of steam and also incorrectly had high certainty that some kind of more "quantum" technology would save computer performance scaling. Most other futurists who discussed computers used a line reasoning like Kurzweil's, who said that we can predict what will happen with "remarkable precision" due to the existence of "well-defined indexes":

The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them

Another thing to note here is that, even if you correctly predict an exponential curve of something, understanding the implications of that precise fact also requires an understanding of the big picture which was shown by people like Yegge, Gates, and Myhrvold but not by the futurists discussed here. An example of roughly getting a scaling curve right but mispredicting the outcome was Dixon on the number of phone lines people would have in their homes. Dixon at least roughly correctly predicted the declining cost of phone lines but incorrectly predicted that this would result in people having many phone lines in their house despite also believing that digital technologies and cell phones would have much faster uptake than they did. With respect to phones, another missed prediction, one that came from not having an understanding of the mechanism was his prediction that the falling cost of phone calls would mean that tracking phone calls would be so expensive relative to the cost of calls that phone companies wouldn't track individual calls.

For someone who has a bit of understanding about the underlying technology, this is an odd prediction. One reason the prediction seems odd is that the absolute cost of tracking who called whom is very small and the rate at which humans make and receive phone calls is bounded at a relatively low rate, so even if the cost of metadata tracking were very high compared to the cost of the calls themselves, the absolute cost of tracking metadata would still be very low. Another way to look at it would be to look at the number of bits of information transferred during a phone call vs. the number of bits of information necessary to store call metadata and the cost of storing that long enough to bill someone on a per-call basis. Unless medium-term storage became relatively more expensive than network by a mind bogglingly large factor, it wouldn't be possible for this prediction to be true and Dixon also implicitly predicted exponentially falling storage costs via his predictions on the size of available computer storage with a steep enough curve that this criteria shouldn't be satisfied and, if it were to somehow be satisfied, the cost of storage would still be so low as to be negligible.

Panacea thinking

Another common issue is what Waleed Khan calls panacea thinking, where the person assumes that the solution is a panacea that is basically unboundedly great and can solve all problems. We can see this for quite a few futurists who were writing up until the 70s, where many assumed that computers would be able to solve any problem that required thought, computation, or allocation of resources and that resource scarcity would become irrelevant. But it turns out that quite a few problems don't magically get solved because powerful computers exist. For example, the 2008 housing crash created a shortfall of labor for housing construction that only barely got back to historical levels just before covid hit. Having fast computers neither prevented this nor fixed this problem after it happened because the cause of the problem wasn't a shortfall of computational resources. Some other topics to get this treatment are "nanotechnology", "quantum", "accelerating growth" / "decreased development time", etc.

A closely related issue that almost every futurist here fell prey to is only seeing the upside of technological advancements, resulting in a kind of techno utopian view of the future. For example, in 2005, Kurzweil wrote:

The current disadvantages of Web-based commerce (for example, limitations in the ability to directly interact with products and the frequent frustrations of interacting with inflexible menus and forms instead of human personnel) will gradually dissolve as the trends move robustly in favor of the electronic world. By the end of this decade, computers will disappear as distinct physical objects, with displays built in our eyeglasses, and electronics woven in our clothing, providing full-immersion visual virtual reality. Thus, "going to a Web site" will mean entering a virtual-reality environment—at least for the visual and auditory senses—where we can directly interact with products and people, both real and simulated.

Putting aside the bit about how non-VR interfaces about computers would disappear before 2010, it's striking how Kurzweil assumes that technological advancement will mean that corporations make experiences better for consumers instead of providing the same level of experience at a lower cost or a worse experience at an even lower cost.⁵

Although that example is from Kurzweil, we can see the same techno utopianism in the other authors on Wikipedia's list with the exception of Zerzan, whose predictions I didn't tally up because prediction wasn't really his shtick. For example, a number of other futurists combined panacea thinking with techno utopianism to predict that computers would cause things to operate with basically perfect efficiency without human intervention, allowing people at large to live a life of leisure. Instead, the benefits to the median person in the U.S. are subtle enough that people debate whether or not life has improved at all for the median person. And on the topic of increased efficiency, a number of people predicted an extreme version of just-in-time delivery that humanity hasn't even come close to achieving and described its upsides, but no futurist under discussion mentioned the downsides of a world-wide distributed just-in-time manufacturing system and supply chain, which includes increased fragility and decreased robustness, notably impacted quite a few industries from 2020 through at least 2022 due to covid despite the worldwide system not being anywhere near as just-in-time or fragile as a number of futurists predicted.

Though not discussed here because they weren't on Wikipedia's list of notable futurists, there are pessimistic futurists such as Jaron Lanier and Paul Ehrlich. From a quick informal look at relatively well-known pessimistic futurists, it seems that pessimistic futurists haven't been more accurate than optimistic futurists. Many made predictions that were too vague to score and the ones who didn't tended to predict catastrophic collapse or overly dystopian futures which haven't materialized. Fundamentally, dystopian thinkers made the same mistakes as utopian thinkers. For example, Paul Ehrlich fell prey to the same issues utopian thinkers fell prey to and he still maintains that his discredited book, The Population Bomb, was fundamentally correct, just like utopian futurists who maintain that their discredited work is fundamentally correct.

Ehrlich's 1968 book opened with

The battle to feed all of humanity is over. In the 1970s the world will undergo famines — hundreds of millions of people are going to starve to death in spite of any crash programs embarked upon now. At this late date nothing can prevent a substantial increase in the world death rate, although many lives could be saved through dramatic programs to "stretch" the carrying capacity of the earth by increasing food production. But these programs will only provide a stay of execution unless they are accompanied by determined and successful efforts at population control. Population control is the conscious regulation of the numbers of human beings to meet the needs, not just of individual families, but of society as a whole.

Nothing could be more misleading to our children than our present affluent society. They will inherit a totally different world, a world in which the standards, politics, and economics of the 1960s are dead.

When this didn't come to pass, he did the same thing as many futurists we looked at and moved the dates on his prediction, changing the text in the opening of his book from "1970s" to "1970s and 1980s". Ehrlich then wrote a new book with even more dire predictions in 1990.

And then later, Ehrlich simply denied ever having made predictions, even though anyone who reads his book can plainly see that he makes plenty of statements about the future with no caveats about the statements being hypothetical:

Anne and I have always followed UN population projections as modified by the Population Reference Bureau — so we never made "predictions," even though idiots think we have.

Unfortunately for pessimists, simply swapping the sign bit on panacea thinking doesn't make predictions more accurate.

Evidence free assumptions

Another major source of errors among these futurists was making an instrumental assumption without any supporting evidence for it. A major example of this is Fresco's theory that you can predict the future by starting from people's values and working back from there, but he doesn't seriously engage with the idea of how people's values can be predicted. Since those are pulled from his intuition without being grounded in evidence, starting from people's values creates a level of indirection, but doesn't fundamentally change the problem of predicting what will happen in the future.

Fin

A goal of this project is to look at current predictors to see who's using methods that have historically had a decent accuracy rate, but we're going to save that for a future post. I normally don't like splitting posts up into multiple parts, but since this post is 30k words (the number of words in a small book, and more words than most pop-sci books have once you remove the pop stories) and evaluating futurists is relatively self-contained, we're going to stop with that (well, with a bit of an evaluation of some longtermist analyses that overlap with this post in the appendix)⁶.

In terms of concrete takeaways, you could consider this post a kind of negative result that supports the very boring idea that you're not going to get very far if you make predictions on topics you don't understand, whereas you might be able to make decent predictions if you have (or gain) a deep expertise of a topic and apply well-honed intuition to predict what might happen. We've looked at, in some detail, a number of common reasoning errors that cause predictions to miss at a high rate and also taken a bit of a look into some things that have worked for creating relatively accurate predictions.

A major caveat about what's worked is that while using high-level techniques that work poorly is a good way to generate poor predictions, using high-level techniques that work well doesn't mean much because the devil is in the details and, as trite as this is to say, you really need to think about things. This is something that people who are serious about looking at data often preach, e.g., you'll see this theme come up on Andrew Gelman's blog as well as in Richard McElreath's Statistical Rethinking. McElreath, in a lecture targeted at social science grad students who don't have a quantitative background, likens statistical methods to a golem. A golem will mindlessly do what you tell it to do, just like statistical techniques. There's no substitute for using your brain to think through whether or not it's reasonable to apply a particular statistical technique in a certain way. People often seem to want to use methods as a talisman to ward off incorrectness, but that doesn't work.

We see this in the longtermist analyses we examine in the appendix which claim to be more accurate than "classical" futurists analyses because they, among other techniques, state probabilities, which the literature on forecasting (e.g., Tetlock's Superforecasting) says that one should do. But the analyses fundamentally use the same techniques as the futurists analyses we looked at here and then add a few things on top that are also things that people who make accurate predictions do. This is backwards. Things like probabilities need to be a core part of modelling, not something added afterwards. This kind of backwards reasoning is a common error when doing data analysis and I would caution readers who think they're safe against errors because their analyses can, at a high level, be described roughly similarly to good analyses⁷. An obvious example of this would be the Bill Gates review we looked at. Gates asked a lot of questions and scribbled quite a few notes in the margins, but asking a lot of questions and scribbling notes in the margins of docs doesn't automatically cause you to have a good understanding of the situation. This example is so absurd that I don't think anyone even remotely reasonable would question it, but most analyses I see (of the present as well as of the future) make this fundamental error in one way or another and, as Fabian Giesen might say, are cosplaying what a rigorous analysis looks like.

Thanks to nostalgebraist, Arb Research (Misha Yagudin, Gavin Leech), Laurie Tratt, Fabian Giesen, David Turner, Yossi Kreinin, Catherine Olsson, Tim Pote, David Crawshaw, Jesse Luehrs, @TyphonBaalAmmon, Jamie Brandon, Tao L., Hillel Wayne, Qualadore Qualadore, Sophia, Justin Blank, Milosz Danczak, Waleed Khan, Mindy Preston, @ESRogs, Tim Rice, and @s__video for comments/corrections/discussion (and probably some others I forgot because this post is so long and I've gotten so many comments).

Update / correction: an earlier version of this post contained this error, pointed out by ESRogs. Although I don't believe the error impacts the conclusion, I consider it a fairly major error. If we were doing a tech-company style postmortem, that it doesn't significantly impact the conclusion would be included in the "How We Got Lucky" section of the postmortem. In particular, this was a "lucky" error because the error was made when picking out a few examples from a large portfolio of errors to give examples of one predictors errors, so a single incorrect error doesn't change the conclusion since another error could be substituted in and, even if no other error were substituted, the reasoning quality of the reasoning being evaluated still looks quite low. But, incorrect concluding that something is an error could lead to a different conclusion in the case of a predictor who made few or no errors, which is why this was a lucky mistake for me to make.

Appendix: brief notes on Superforecasting

Very difficult to predict more than 3-5 years out; people generally don't do much better than random
- Later in the book, 10 years is cited as a basically impossible timeframe, but scopes that to certain kinds of predictions (the earlier statement of 3-5 years is more general) > Taleb, Kahneman, and I agree there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious—“there will be conflicts”—and the odd lucky hits that are inevitable whenever lots of forecasters make lots of forecasts. These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out. And yet, this sort of forecasting is common, even within institutions that should know better
- One possibility is that people like Bill Gates are right due to hindsight bias, but that doesn't seem correct w.r.t., e.g., being at Google making it obvious that mobile was the only way forward circa 2010
Ballmer prediction: "There’s no chance that the iPhone is going to get any significant market share. No chance."
Very important to precisely write down forecasts
"big idea" predictors inaccurate (as in, heavily rely on one or a few big hammers, like "global warming", "ecological disaster", "Moore's law", etc., to drive everything
Specific knowledge predictors (relatively) accurate; relied heavily on probabilistic thinking, used different analytical tools as appropriate
Good forecasters are fluent with numbers, generally aced numerical proficiency test given to forecasters, think probabilistically
Good forecasters not particularly high IQ; typical non super-forecaster IQ from forecaster population was 70%-ile; typical forecaster IQ was 80%-ile

See also, this Tetlock interview with Tyler Cowen if you don't want to read the whole book, although the book is a very quick read because it's written the standard pop-sci style, with a lot of anecdotes/stories.

On the people we looked at vs. the people Tetlock looked at, the predictors we looked at are operating in a very different style from the folks studied in the studies that led to the Superforecasting book. Both futurists and tech leaders were trying to predict a vision for the future whereas superforecasters were asked to answer very specific questions.

Another major difference among the accurate predictors is that the accurate predatictors we looked at (other than Caplan) had very deep expertise in their fields. This may be one reason for the difference in timelines here, where it appears that some of our predictors can predict things more than 3-5 years out, contra Tetlock's assertion. Another difference is in the kind of thing being predicted — a lot of the predictions we're looking at here are fundamentally whether or not a trend will continue or if a nascent trend will become a long-running trend, which seems easier than a lot of the questions Tetlock had his forecasters try to answer. For example, in the opening of Superforecasting, Tetlock gives predicting the Arab Spring as an example of something that would've been practically impossible — while the conditions for it had been there for years, the proximal cause of the Arab Spring was a series of coincidences that would've been impossible to predict. This is quite different from and arguably much more difficult than someone in 1980 guessing that computers will continue to get smaller and faster, leading to handheld computers more powerful than supercomputers from the 80s.

Appendix: other evaluations

Justin Rye on Heinlein, Clarke, and Asimov
Holden Karnofsky / Arb Research on Heinlen, Clarke, and Asimov, as well as Karnofsky on Kurzweil, Kahn, and Weiner
Various, on Ray Kurzweil (try googling, without quotes, "Kurzweil 86% accuracy")
A variety of HN commenters on a futurist who scored themselves at 50% accuracy
Laurie Tratt on a some 2005 predictions on what will be important in computing
Mark Loveless on his own infosec predictions as far back as 1995

Of these, the evaluations above, the only intersection with the futurists evaluated here is Kurzweil. Holden Karnofsky says:

A 2013 project assessed Ray Kurzweil's 1999 predictions about 2009, and a 2020 followup assessed his 1999 predictions about 2019. Kurzweil is known for being interesting at the time rather than being right with hindsight, and a large number of predictions were found and scored, so I consider this study to have similar advantages to the above study. ... Kurzweil is notorious for his very bold and contrarian predictions, and I'm overall inclined to call his track record something between "mediocre" and "fine" - too aggressive overall, but with some notable hits

Karnofsky's evaluation of Kurzweil being "fine" to "mediocre" relies on these two analyses done on LessWrong and then uses a very generous interpretation of the results to conclude that Kurzweil's predictions are fine. Those two posts rate predictions as true, weakly true, cannot decide, weakly false, or false. Karnofsky then compares the number of true + weakly true to false + weakly false, which is one level of rounding up to get an optimistic result; another way to look at it is that any level other than "true" is false when read as written. This issue is magnified if you actually look at the data and methodology used in the LW analyses.

In the second post, the author, Stuart Armstrong indirectly noted that there were actually no predictions that were, by strong consensus, very true when he noted that the "most true" prediction had a mean score of 1.3 (1 = true, 2 = weakly true ... , 5 = false) and the second highest rated prediction had a mean score of 1.4. Although Armstrong doesn't note this in the post, if you look at the data, you'll see that the third "most true" prediction had a mean score of 1.45 and the fourth had a mean score of 1.6, i.e., if you round to the nearest prediction score, only 3 out of 105 predictions score "true" and 32 are >= 4.5 and score "false". Karnofsky reads Armstrong's as scoring 12% of predictions true, but the post effectively makes no comment on what fraction of predictions were scored true and the 12% came from summing up the total number of each rating given.

I'm not going to say that taking the mean of each question is the only way one could aggregate the numbers (taking the median or modal values could also be argued for, as well as some more sophisticated scoring function, an extremizing function, etc.), but summing up all of the votes across all questions results in a nonsensical number that shouldn't be used for almost anything. If every rater rated every prediction or there was a systematic interleaving of who rated what questions, then the number could be used for something (though not as a score for what fraction of predictions are accurate), but since each rater could skip any questions (although people were instructed to start rating at the first question and rate all questions until they stop, people did not do that and skipped arbitrary questions), aggregating the number of each score given is not meaningful and actually gives very little insight into what fraction of questions are true. There's an air of rigor about all of this; there are lots of numbers, standard deviations are discussed, etc., but the way most people, including Karnofsky, interpret the numbers in the post is incorrect. I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but as we've noted previously, almost no one actually reads what they're citing.

Coming back to Karnofsky's rating of Kurzweil as fine to mediocre, this relies on two levels of rounding. One, doing the wrong kind of aggregation on the raw data to round an accuracy of perhaps 3% up to 12% and then rounding up again by doing the comparison mentioned above instead of looking at the number of true statements. If we use a strict reading and look at the 3%, the numbers aren't so different from what we see in this post. If we look at Armstrong's other post, there are too few raters to really produce any kind of meaningful aggregation. Armstrong rated every prediction, one person rated 68% of predictions, and no one else even rated half of the 172 predictions. The 8 predictors rated 506 predictions, so the number of ratings is equivalent to having 3 raters rate all predictions, but the results are much noisier due to the arbitrary way people decided to pick predictions. This issue is much worse for the 2009 predictions than the 2019 predictions due to the smaller number of raters combined with the sparseness of most raters, making this data set fairly low fidelity; if you want to make a simple inference from the 2019 data, you're probably best off using Armstrong's ratings and discarding the rest (there are non-simple analyses one could do, but if you're going to do that, you might as well just rate the predictions yourself).

Another fundamental issue with the analysis is that it relies on aggregating votes of from a population that's heavily drawn from Less Wrong readers and the associated community. As we discussed here, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of thought (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful? I often see people refer to the high-level "wisdom of crowds" idea, but if we look at the specific statements endorsed by online crowds, we can see that these crowds are often not so wise. In the Arb Research evaluation (discussed below), they get around this problem by checking reviewing answers themselves and also offering a bounty for incorrectly graded predictions, which is one way to deal with having untrustworthy raters, but Armstrong's work has no mitigation for this issue.

On the Karnofsky / Arb Research evaluation, Karnofsky appears to use a less strict scoring than I do and once again optimistically "rounds up". The Arb Research report scores each question as "unambiguously wrong", "ambiguous or near miss", or "unambiguously right" but Karnofsky's scoring removes the ambiguous and near miss results, whereas my scoring only removes the ambiguous results, the idea being that a near miss is still a miss. Accounting for those reduces the scores substantially but still leaves Heinlen, Clarke, and Asimov with significantly higher scores than the futurists discussed in the body of this post. For the rest, many of the predictions that were scored as "unambiguously right" are ones I would've declined to rate for similar reasons to predictions which I declined to rate (e.g., a prediction that something "may well" happen was rated as "unambiguously right" and I would consider that unfalsifiable and therefore not include it). There are also quite a few "unambiguously right" predictions that I would rate as incorrect using a strict reading similar to the readings that you can see below in the detailed appendix.

Another place where Karnofsky rounds up is that Arb research notes that 'The predictions are usually very vague. Almost none take the form “By Year X technology Y will pass on metric Z”'. This makes the prediction accuracy from futurists Arb Research looked at not comparable to precise predictions of the kind Caplan or Karnofsky himself makes, but Karnofsky directly uses those numbers to justify why his own predictions are accurate without noting that the numbers are not comparable. Since the non-comparable numbers were already rounded up, there are two levels of rounding here (more on this later).

As noted above, some of the predictions are ones that I wouldn't rate because I don't see where the prediction is, such as this one (this is the "exact text" of the prediction being scored, according to the Arb Research spreadsheet), which was scored "unambiguously right"

application of computer technology to professional sports be counterproduc- tive? Would the public become less interested in sports or in betting on the outcome if matters became more predictable? Or would there always be enough unpredictability to keep interest high? And would people derive particular excitement from beat ing the computer when low-ranking players on a particular team suddenly started

This seems like a series of questions about something that might happen, but wouldn't be false if none of these happened, so would not count as a prediction in my book.

Similarly, I would not have rated the following prediction, which Arb also scored "unambiguously right"

its potential is often realized in ways that seem miraculous, not because of idealism but because of the practical benefits to society. Thus, the computer's ability to foster human creativity may well be utilized to its fullest, not because it would be a wonderful thing but because it will serve important social functions Moreover, we are already moving in the

Another kind of prediction that was sometimes scored "unambiguously correct" that I declined to score were predictions of the form "this trend that's in progress will become somewhat {bigger / more important}, such as the following:

The consequences of human irresponsibility in terms of waste and pollution will become more apparent and unbearable with time and again, attempts to deal with this will become more strenuous. It is to be hoped that by 2019, advances in technology will place tools in our hands that will help accelerate the process whereby the deterioration of the environment will be reversed.

On Karnofsky's larger point, that we should trust longtermist predictions because futurists basically did fine and longtermsists are taking prediction more seriously and trying harder and should therefore generate better prediction, that's really a topic for another post, but I'll briefly discuss here because of the high intersection with this post. There are two main pillars of this argument. First, that futurists basically did fine which, as we've seen, relies on a considerable amount of rounding up. And second, that the methodologies that longtermists are using today are considerably more effective than what futurists did in the past.

Karnofsky says that the futurists he looked at "collect casual predictions - no probabilities given, little-to-no reasoning given, no apparent attempt to collect evidence and weigh arguments", whereas Karnofsky's summaries use (among other things):

Reports that Open Philanthropy employees spent thousands of hours on, systematically presenting evidence and considering arguments and counterarguments.
A serious attempt to take advantage of the nascent literature on how to make good predictions; e.g., the authors (and I) have generally done calibration training, and have tried to use the language of probability to be specific about our uncertainty.

We've seen, when evaluating futurists with an eye towards evaluating longtermists, Karnofsky heavily rounds up in the same way Kurzweil and other futurists do, to paint the picture they want to create. There's also the matter of his summary of a report on Kurzweil's predictions being incorrect because he didn't notice the author of that report used a methodology that produced nonsense numbers that were favorable to the conclusion that Karnofsky favors. It's true that Karnofsky and the reports he cites do the superficial things that the forecasting literature notes is associated with more accurate predictions, like stating probabilities. But for this to work, the probabilities need to come from understanding the data. If you take a pile of data, incorrectly interpret it and then round up the interpretation further to support a particular conclusion, throwing a probability on it at the end is not likely to make it accurate. Although he doesn't use these words, a key thing Tetlock notes in his work is that people who round things up or down to conform to a particular agenda produce low accuracy predictions. Since Karnofsky's errors and rounding heavily lean in one direction, that seems to be happening here.

We can see this in other analyses as well. Although digging into material other than futurist predictions is outside of the scope of this post, nostalgebraist has done this and he said (in a private communication that he gave me permission to mention) that Karnofsky's summary of https://openphilanthropy.org/research/could-advanced-ai-drive-explosive-economic-growth/ is substantially more optimistic about AI timelines than the underlying report in that there's at least one major concern raised in the report that's not brought up as a "con" in Karnofsky's summary and nostalgebraist later wrote this post, where he (implicitly) notes that the methodology used in a report he examined in detail is fundamentally not so different than what the futurists we discussed used. There are quite a few things that may make the report appear credible (it's hundreds of pages of research, there's a complex model, etc.), but when it comes down to it, the model boils down to a few simple variables. In particular, a huge fraction of the variance of whether or not TAI is likely or not likely comes down to the amount of improvement will occur in terms of hardware cost, particularly FLOPS/$. The output of the model can range from 34% to 88% depending how much improvement we get in FLOPS/$ after 2025. Putting in arbitrarily large FLOPS/$ amounts into the model, i.e., the scenario where infinite computational power is free (since other dimensions, like storage and network aren't in the model, let's assume that FLOPS/$ is a proxy for those as well), only pushes the probability of TAI up to 88%, which I would rate as too pessimistic, although it's hard to have a good intuition about what would actually happen if infinite computational power were on tap for free. Conversely, with no performance improvement in computers, the probability of TAI is 34%, which I would rate as overly optimistic without a strong case for it. But I'm just some random person who doesn't work in AI risk and hasn't thought about too much, so your guess on this is as good as mine (and likely better if you're the equivalent of Yegge or Gates and work in the area).

The part about all of this that makes this fundamentally the same thing that the futurists here did is that the estimate of the FLOPS/$ which is instrumental for this prediction is pulled from thin air by someone who is not a deep expert in semiconductors, computer architecture, or a related field that might inform this estimate.

As Karnofsky notes, a number of things were done in an attempt to make this estimate reliable ("the authors (and I) have generally done calibration training, and have tried to use the language of probability") but, when you come up with a model where a single variable controls most of the variances and the estimate for that variable is picked out of thin air, all of the modeling work actually reduces my confidence in the estimate. If you say that, based on your intuition, you think there's some significant probability of TAI by 2100; 10% or 50% or 80% or whatever number you want, I'd say that sounds plausible (why not? things are improving quickly and may continue to do so) but wouldn't place any particular faith in the estimate. If you build a model where the output hinges on a relatively small number of variables and then say that there's an 80% chance, based a critical variable out of thin air, should that estimate be more or less confidence inspiring than the estimate based solely on intuition? I don't think the answer should be that output is higher confidence. The direct guess of 80% is at least honest about its uncertainty. In the model-based case, since the model doesn't propagate uncertainties and the choice of a high but uncertain number can cause the model to output a fairly certain number, like 88%, there's a disconnect between the actual uncertainty produced by the model and the probability estimate.

At one point, in summarizing the report, Karnofsky says

I consider the "evolution" analysis to be very conservative, because machine learning is capable of much faster progress than the sort of trial-and-error associated with natural selection. Even if one believes in something along the lines of "Human brains reason in unique ways, unmatched and unmatchable by a modern-day AI," it seems that whatever is unique about human brains should be re-discoverable if one is able to essentially re-run the whole history of natural selection. And even this very conservative analysis estimates a ~50% chance of transformative AI by 2100

But it seems very strong to call this a "very conservative" estimate when the estimate implicitly relies on future FLOPS/$ improvement staying above some arbitrary, unsupported, threshold. In the appendix of the report itself, it's estimated that there will be a 6 order of magnitude (OOM) improvement and that a 4 OOM improvement would be considered conservative, but why should we expect that 6 OOM is the amount of headroom left for hardware improvement and 4 OOM is some kind of conservative goal that we'll very likely reach? Given how instrumental these estimates are to the output of the model, there's a sense in which the uncertainty of the final estimate has to be at least as large as the uncertainty of these estimates multiplied by their impact on the model but that can't be the case here given the lack of evidence or justification for these inputs to the model.

More generally, the whole methodology is backwards — if you have deep knowledge of a topic, then it can be valuable to put a number down to convey the certainty of your knowledge to other people, and if you don't have deep knowledge but are trying to understand an area, then it can be valuable to state your uncertainties so that you know when you're just guessing. But here, we have a fairly confidently stated estimate (nostalgebraist notes that Karnofsky says "Bio Anchors estimates a >10% chance of transformative AI by 2036, a ~50% chance by 2055, and an ~80% chance by 2100.") that's based off of a model that's nonsense that relies on a variable that's picked out of thin air. Naming a high probability after the fact and then naming a lower number and saying that's conservative when it's based on this kind of modeling is just window dressing. Looking at Karnofsky's comments elsewhere, he lists a number of extremely weak pieces of evidence in support of his position, e.g., in the previous link, he has a laundry list of evidence of mixed strength, including Metaculus, which nostaglebraist has noted is basically worthless for this purpose here and here. It would be very odd for someone who's truth seeking on this particular issue to cite so many bad pieces of evidence; creating a laundry list of such mixed evidence is consistent with someone who has a strong prior belief and is looking for anything that will justify it, no matter how weak. That would also be consistent with the shoddy direct reasoning noted above.

Back to other evaluators, on Justin Rye's evaluations, I would grade the predictions "as written" and therefore more strictly than he did and would end up with lower scores.

For the predictors we looked at in this document who mostly or nearly exclusively give similar predictions, I declined to give them anything like a precise numerical score. To be clear, I think there's value in trying to score vague predictions and near misses, but that's a different thing than this document did, so the scores aren't directly comparable.

A number of people have said that predictions by people who make bold predictions, the way Kurzweil does, are actually pretty good. After all, if someone makes a lot of bold predictions and they're all off by 10 years, that person will have useful insights even if they lose all their bets and get taken to the cleaners in prediction markets. However, that doesn't mean that someone who makes bold predictions should always "get credit for" making bold predictions. For example, in Kurzweil's case, 7% accuracy might not be bad if he uniformly predicted really bold stuff like unbounded life span by 2011. However, that only applies if the hits and misses are both bold predictions, which was not the case in the sampled set of predictions for Kurzweil here. For Kurzweil's predictions evaluated in this document, Kurzweil's correct predictions tended to be very boring, e.g., there will be no giant economic collapse that stops economic growth, cochlear implants will be in widespread use in 2019 (predicted in 1999), etc.

The former is a Caplan-esque bet against people who were making wild predictions that there would be severe or total economic collapse. There's value in bets like that, but it's also not surprising when such a bet is successful. For the latter, the data I could quickly find on cochlear implant rates showed that implant rates slowly linearly increased from the time Kurzweil made the bet until 2019. I would call that a correct prediction, but the prediction is basically just betting that nothing drastically drops cochlear implant rates, making that another Caplan-esque safe bet and not a bet that relies on Kurzweil's ideas about the law of accelerating growth that his wild bets rely on.

If someone makes 40 boring bets of which 7 are right and another person makes 40 boring bets and 22 wild bets and 7 of their boring bets and 0 of their wild bets are right (these are arbitrary numbers as I didn't attempt to classify Kurzweil's bets as wild or not other than the 7 that were scored as correct), do you give the latter person credit for having "a pretty decent accuracy given how wild their bets were"? I would say no.

On the linked HN thread from a particular futurist, a futurist scored themselves 5 out of 10, but most HN commenters scored the same person at 0 out of 10 or, generously, at 1 out of 10, with the general comment that the person and other futurists tend to score themselves much too generously:

sixQuarks: I hate it when “futurists” cherry pick an outlier situation and say their prediction was accurate - like the bartender example.

karaterobot: I wanted to say the same thing. He moved the goal posts from things which "would draw hoots of derision from an audience from the year 2022" to things which there has been some marginal, unevenly distributed, incremental change to in the last 10 years, then said he got it about 50% right. More generally, this is the issue I have with futurists: they get things wrong, and then just keep making more predictions. I suppose that's okay for them to do, unless they try to get people to believe them, and make decisions based on their guesses.

chillacy: Reminded me of the ray [kurzweil] predictions: extremely generous grading.)

Appendix: other reading

Richard Sites and his DEC colleagues presciently looking 30+ years into the future with respect to computer architecture (written in 1992, summarizing work started in 1988)
- Not included in main list of people with accurate predictions because the implicit and explicit predictions here are so narrow, but this is a stellar example of using deep domain knowledge to forsee the future as well as understand what current actions will make sense decades down the line
Andrew Gelman on forecast bets as probability assessments
Nostaglebraist on how a lot of AI commenters are behaving like futurists of days past
Scott Alexander on the optimistic side of an AI progress bet winning
Andrew Gelman on silly graphs in predictions
Rodney Brooks on success to date on takings the pessimistic side on AI progress (he calls this the realistic side but, in this context, I consider that to be a more loaded term)
Bryan Caplan on how no one looked into his quantitative results, despite many comments on whether or not his work was correct. See also, me on the same phenomenon elsewhere

Appendix: detailed information on predictions

Ray Kurzweil

4/59 for rated predictions. If you feel like the ones I didn't include that one could arguably include should count, then 7/62.

This list comes from wikipedia's bulleted list of Kurzweil's predictions at the time Peter Diamadis, Kurzweil's co-founder for SingularityU, cited it to bolster the claim that Kurzweil has an 86% prediction accuracy rate. Off the top of my head, this misses quite a few predictions that Kurzweil made, such as life expectancy being "over one hundred" by 2019 and 120 by 2029 (prediction made in 1999) and unbounded (life expectancy increasing at one year per year) by 2011 (prediction made in 2001), that a computer would beat the top human in chess by 2000 (prediction made in 1990).

It's likely that Kurzweil's accuracy rate would change somewhat if we surveyed all of his predictions, but it seems extremely implausible for the rate to hit 86% and, more broadly, looking at Kurzweil's vision of what the world would be like, it also seems impossible that we live in a world that's generally close to Kurzweil's imagined future.

1985
- Voice activated typewriter / speech writer by 1985 (founded a company to build this in 1982)
  - No. Not true in any meaningful sense. Speech to text with deep learning, circa 2013, was accurate enough that it could be used, with major corrections, on a computer, but it would've been hopeless for a typewriter
"Early 2000s" (wikipedia notes that this is listed before 2010 in Kurzweil's chronology, so this should be significantly before 2010 unless the book is very poorly organized)
- Translating telephones allow people to speak to each other in different languages.
  - No. Today, this works poorly and translations are comically bad, but can sort of work in a "help a tourist get around" sort of way with deep learning, but was basically hopeless in 2010
- Machines designed to transcribe speech into computer text allow deaf people to understand spoken words.
  - No. Per above, very poor in 2010
- Exoskeletal, robotic leg prostheses allow the paraplegic to walk.
  - No. Maybe some prototype existed, but this still isn't meaningfully deployed in 2022
- Telephone calls are routinely screened by intelligent answering machines that ask questions to determine the call's nature and priority.
  - Definitely not in 2010. This arguably exists in 2022, although I think it would be a stretch to call phone trees "intelligent" since they generally get confused if you don't do the keyword matching they're looking for
- "Cybernetic chauffeurs" can drive cars for humans and can be retrofitted into existing cars. They work by communicating with other vehicles and with sensors embedded along the roads.
  - No.
"Early 21st century" (wikipedia notes that this is listed before 2010 in Kurzweil's chronology, so this should be significantly before 2010 unless the book is very poorly organized)
- The classroom is dominated by computers. Intelligent courseware that can tailor itself to each student by recognizing their strengths and weaknesses. Media technology allows students to manipulate and interact with virtual depictions of the systems and personalities they are studying.
  - No. If you really want to make a stretch argument, you could say this about 2022, but I'd still say no for 2022
- A small number of highly skilled people dominates the entire production sector. Tailoring of products for individuals is common.
  - No. You could argue that, as written, the 2nd part of this was technically satisfied, but that was really in a trivial way compared the futurist vision Kurzweil was predicting
- Drugs are designed and tested in simulations that mimic the human body.
  - No.
- Blind people navigate and read text using machines that can visually recognize features of their environment.
  - Not in 2010. Deep learning unlocked some of this later, though, and continues to improve
2010
- PCs are capable of answering queries by accessing information wirelessly via the Internet.
  - Yes
2009
- Most books will be read on screens rather than paper.
  - No
- Most text will be created using speech recognition technology.
  - No
- Intelligent roads and driverless cars will be in use, mostly on highways.
  - No
- People use personal computers the size of rings, pins, credit cards and books.
  - No. One of these was true (books), but the prediction is an "and" and not an "or"
- Personal worn computers provide monitoring of body functions, automated identity and directions for navigation.
  - No. Arguably true with things like a Garmin band some athletes wear around the chest for heart rate, but not true when the whole statement is taken into account or in the spirit of the prediction
- Cables are disappearing. Computer peripheries use wireless communication.
  - No. Even in 2022, cables generally haven't come close to disappearing and, unfortunately, wireless perpihphals generally work poorly (Gary Bernhardt, Ben Kuhn, etc.)
- People can talk to their computer to give commands.
  - Yes. I would say this one is actually a "no" in spirit if you look at Kurzweil's futurist vision, but it was technically true that this was possible in 2009, although it worked quite poorly
- Computer displays built into eyeglasses for augmented reality are used
  - No. You can argue that someone, somewhere, was using these, but pilots were using head mounted displays in 1999, so it's nonsensical to argue that limited uses like that constitute a successful prediction of the future
- Computers can recognize their owner's face from a picture or video.
  - No
- Three-dimensional chips are commonly used.
  - No
- Sound producing speakers are being replaced with very small chip-based devices that can place high resolution sound anywhere in three-dimensional space.
  - No
- A $1,000 computer can perform a trillion calculations per second.
  - Undefined. Technically true, but using peak ops to measure computer performance is generally considered too silly to do by people who know much about computers. In this case, for this to merely be a bad benchmark and not worthless, the kind of calculation would have to be defined.
- There is increasing interest in massively parallel neural nets, genetic algorithms and other forms of "chaotic" or complexity theory computing.
  - No. There was a huge uptick in interest in neural nets in 2012 due to the "Alexnet" paper, but note that this prediction is an "and" and would've been untrue even in the "or" form in 2009
- Research has been initiated on reverse engineering the brain through both destructive and non-invasive scans.
  - Undefined. Very vague and could easily argue this either way
- Autonomous nanoengineered machines have been demonstrated and include their own computational controls.
  - Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up
2019
- The computational capacity of a $4,000 computing device (in 1999 dollars) is approximately equal to the computational capability of the human brain (20 quadrillion calculations per second).
  - Undefined. Per above prediction on computational power, raw ops per second is basically meaningless
- The summed computational powers of all computers is comparable to the total brainpower of the human race.
  - Undefined. First, you need a non-stupid metric to compare these by
- Computers are embedded everywhere in the environment (inside of furniture, jewelry, walls, clothing, etc.).
  - No. There are small computers, but this is arguing they're ubiquitously inside common household items, which they're not
- People experience 3-D virtual reality through glasses and contact lenses that beam images directly to their retinas (retinal display). Coupled with an auditory source (headphones), users can remotely communicate with other people and access the Internet.
  - No
- These special glasses and contact lenses can deliver "augmented reality" and "virtual reality" in three different ways. First, they can project "heads-up-displays" (HUDs) across the user's field of vision, superimposing images that stay in place in the environment regardless of the user's perspective or orientation. Second, virtual objects or people could be rendered in fixed locations by the glasses, so when the user's eyes look elsewhere, the objects appear to stay in their places. Third, the devices could block out the "real" world entirely and fully immerse the user in a virtual reality environment.
  - No. You need different devices for these use cases and for the HUD use case, the field of view is small and images do not stay in place regardless of the user's perspective or orientation
- People communicate with their computers via two-way speech and gestures instead of with keyboards. Furthermore, most of this interaction occurs through computerized assistants with different personalities that the user can select or customize. Dealing with computers thus becomes more and more like dealing with a human being.
  - No. Some people sometimes do this, but I'd say this implies with "instead" that speech and gestures have replaced keyboards, which they have not
- Most business transactions or information inquiries involve dealing with a simulated person.
  - No
- Most people own more than one PC, though the concept of what a "computer" is has changed considerably: Computers are no longer limited in design to laptops or CPUs contained in a large box connected to a monitor. Instead, devices with computer capabilities come in all sorts of unexpected shapes and sizes.
  - No if you literally use the definition of "most people" and consider a PC to be a general purpose computing device (which a smartphone arguably is), but probably yes for people at, say, 90%-ile wealth and above in the U.S. or other high-SES countries
- Cables connecting computers and peripherals have almost completely disappeared.
  - No
- Rotating computer hard drives are no longer used.
  - No
- Three-dimensional nanotube lattices are the dominant computing substrate.
  - No
- Massively parallel neural nets and genetic algorithms are in wide use.
  - No. Note the use of "and" here
- Destructive scans of the brain and noninvasive brain scans have allowed scientists to understand the brain much better. The algorithms that allow the relatively small genetic code of the brain to construct a much more complex organ are being transferred into computer neural nets.
  - No
- Pinhead-sized cameras are everywhere.
  - No
- Nanotechnology is more capable and is in use for specialized applications, yet it has not yet made it into the mainstream. "Nanoengineered machines" begin to be used in manufacturing.
  - Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up
- Thin, lightweight, handheld displays with very high resolutions are the preferred means for viewing documents. The aforementioned computer eyeglasses and contact lenses are also used for this same purpose, and all download the information wirelessly.
  - No. Ironically, a lot of people prefer things like Kindles for viewing documents, but they're quite low resolution (a 2019 Kindle has a resolution of 800x600); many people still prefer paper for viewing documents for a variety of reasons
- Computers have made paper books and documents almost completely obsolete.
  - No
- Most learning is accomplished through intelligent, adaptive courseware presented by computer-simulated teachers. In the learning process, human adults fill the counselor and mentor roles instead of being academic instructors. These assistants are often not physically present, and help students remotely. Students still learn together and socialize, though this is often done remotely via computers.
  - No
- All students have access to computers.
  - No. True in some places, though.
- Most human workers spend the majority of their time acquiring new skills and knowledge.
  - No
- Blind people wear special glasses that interpret the real world for them through speech. Sighted people also use these glasses to amplify their own abilities. Retinal and neural implants also exist, but are in limited use because they are less useful.
  - No
- Deaf people use special glasses that convert speech into text or signs, and music into images or tactile sensations. Cochlear and other implants are also widely used.
  - Yes? I think this is actually a no in terms of whether or not Kurzweil's vision was realized, but these are possible and it isn't the case that no one was using these. I'm bundling the Cochlear implant prediction in here because it's so boring. It was arguably already true when the prediction was made in 1999 and reaching the usage rate it did in 2019 basically just continued slow linear growth of implant rate, i.e., people not rejecting the idea of cochlear implants outright and/or something else superseding cochlear implants.
- People with spinal cord injuries can walk and climb steps using computer-controlled nerve stimulation and exoskeletal robotic walkers.
  - No
- Computers are also found inside of some humans in the form of cybernetic implants. These are most commonly used by disabled people to regain normal physical faculties (e.g. Retinal implants allow the blind to see and spinal implants coupled with mechanical legs allow the paralyzed to walk).
  - No, at least not at the ubiquity implied by Kurzweil's vision
- Language translating machines are of much higher quality, and are routinely used in conversations.
  - Yes, but mostly because this prediction is basically meaningless (language translation was of a "much higher quality" in 2019 than 1999)
- Effective language technologies (natural language processing, speech recognition, speech synthesis) exist
  - Yes, although arguable
- Access to the Internet is completely wireless and provided by wearable or implanted computers.
  - No
- People are able to wirelessly access the Internet at all times from almost anywhere
  - No. This might feel true inside a big city, but is obviously untrue even on a road trip that stays on the U.S. interstate highway system and becomes even less true if you drive away from the interstate and less true once again if you go to places that can't be driven to
- Devices that deliver sensations to the skin surface of their users (e.g. tight body suits and gloves) are also sometimes used in virtual reality to complete the experience. "Virtual sex"—in which two people are able to have sex with each other through virtual reality, or in which a human can have sex with a "simulated" partner that only exists on a computer—becomes a reality. Just as visual- and auditory virtual reality have come of age, haptic technology has fully matured and is completely convincing, yet requires the user to enter a V.R. booth. It is commonly used for computer sex and remote medical examinations. It is the preferred sexual medium since it is safe and enhances the experience.
  - No
- Worldwide economic growth has continued. There has not been a global economic collapse.
  - Yes
- The vast majority of business interactions occur between humans and simulated retailers, or between a human's virtual personal assistant and a simulated retailer.
  - No? Depends on what "simulated retailers" means here. In conjunction with how Kurzweil talks about simulations, VR, haptic devices that are fully immersive, etc., I'd say this is a "no"
- Household robots are ubiquitous and reliable
  - No
- Computers do most of the vehicle driving—-humans are in fact prohibited from driving on highways unassisted. Furthermore, when humans do take over the wheel, the onboard computer system constantly monitors their actions and takes control whenever the human drives recklessly. As a result, there are very few transportation accidents.
  - No
- Most roads now have automated driving systems—networks of monitoring and communication devices that allow computer-controlled automobiles to safely navigate.
  - No
- Prototype personal flying vehicles using microflaps exist. They are also primarily computer-controlled.
  - Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up
- Humans are beginning to have deep relationships with automated personalities, which hold some advantages over human partners. The depth of some computer personalities convinces some people that they should be accorded more rights
  - No
- A growing number of humans believe that their computers and the simulated personalities they interact with are intelligent to the point of human-level consciousness, experts dismiss the possibility that any could pass the Turing Test. Human-robot relationships begin as simulated personalities become more convincing.
  - No
- Interaction with virtual personalities becomes a primary interface
  - No? Depends on what "primary interface" means here, but I think not given Kurzweil's overall vision
- Public places and workplaces are ubiquitously monitored to prevent violence and all actions are recorded permanently. Personal privacy is a major political issue, and some people protect themselves with unbreakable computer codes.
  - No. True of some public spaces in some countries, but untrue as stated.
- The basic needs of the underclass are met
  - No. Not even true when looking at some high-SES countries, like the U.S., let alone the entire world
- Virtual artists—creative computers capable of making their own art and music—emerge in all fields of the arts.
  - No. Maybe arguably technically true, but I think not even close in spirit in 2019

The list above only uses the bulleted predictions from Wikipedia under the section that has per-timeframe sections. If you pull in other ones from the same page that could be evaluated, which includes predictions like " "nanotechnology-based" flying cars would be available [by 2026]", this doesn't hugely change the accuracy rate (and actually can't due to the relatively small number of other predictions).

Jacque Fresco

The foreword to Fresco's book gives a pretty good idea of what to expect from Fresco's predictions:

Looking forward is an imaginative and fascinating book in which the authors take you on a journey into the culture and technology of the twenty-first century. After an introductory section that discusses the "Things that Shape Your Future." you will explore the whys and wherefores of the unfamiliar, alarming, but exciting world of a hundred years from now. You will see this society through the eyes of Scott and Hella, a couple of the next century. Their living quarters are equipped with a cybernator. a seemingly magical computer device, but one that is based on scientific principles now known. It regulates sleeping hours, communications throughout the world, an incredible underwater living complex, and even the daily caloric intake of the "young" couple. (They are in their forties but can expect to live 200 years.) The world that Scott and Hella live in is a world that has achieved full weather control, has developed a finger-sized computer that is implanted in the brain of every baby at birth (and the babies are scientifically incubated the women of the twenty-first century need not go through the pains of childbirth), and that has perfected genetic manipulation that allows the human race to be improved by means of science. Economically, the world is Utopian by our standards. Jobs, wages, and money have long since been phased out. Nothing has a price tag, and personal possessions are not needed. Nationalism has been surpassed, and total disarmament has been achieved; educational technology has made schools and teachers obsolete. The children learn by doing, and are independent in this friendly world by the time they are five.

The chief source of this greater society is the Correlation Center, "Corcen," a gigantic complex of computers that serves but never enslaves mankind. Corcen regulates production, communication, transportation and all other burdensome and monotonous tasks of the past. This frees men and women to achieve creative challenging experiences rather than empty lives of meaningless leisure. Obviously this book is speculative, but it is soundly based upon scientific developments that are now known

As mentioned above, Fresco makes the claim that it's possible to predict the future and to do so, one should start with the values people will have in the future. Many predictions are about "the 21st century" so can arguably be defended as still potentially accurate, although the way the book talks about the stark divide between "the 20th century" and "the 21st century", we should have already seen the changes mentioned in the book since we're no longer in "the 20th century" and the book makes no reference to a long period of transition in between. Fresco does make some specific statements about things that will happen by particular dates, which are covered later. For "the 21st century", his predictions from the first section of his book are:

There will be no need for laws, such as a law against murder because humans will no longer do things like murder (which only happen "today" because "our sick society" conditions people to commit depraved acts
- "Today we are beginning to identify various things which condition us to act as we do. In the future the factors that condition human beings to kill or do other things that harm fellow human beings will be understood and eliminated"
  - The entire section is very behaviorist and assumes that we'll be able to operant condition people out of all bad behaviors
Increased understanding of human nature will lead to
- Total freedom, including no individual desire for conformity
- Total economic abundance, which will lead to the end of "competitiveness, acquisitiveness, thriftiness", etc.
- Total freedom from disease
- Deeper feelings of love and friendship to an extent that can not be understood by those who live in the twentieth-century world of scarcity"
- Total lack of guilt about sex
- Appreciation of all kinds of natural beauty, as opposed to "the narrow standards of the 'beauty queen' mentality of today." as well as eschewing any kind of artificial beauty
- Complete self-knowledge, lack of any repression, leading to "produce a new dimension of relaxed living that is almost unknown today"
- Elevation of the valuing of others at the same level people value themselves or local communities, i.e., complete selflessness and an end to anything resembling tribalism or nationalism
- All people will be "multidimensional" and sort of good at everything
- This is contrasted with "For the first time all men and women will live a multidimensional life, limited only by their imagination. In the twentieth century we could classify people by saying, "He is good in sports. She is an intellectual. He is an artist." In the future all people will have the time and the facilities to accept the fantastic variety of challenges that life offers them"

As mentioned above, the next part of Fresco's prediction is about how science will work. He writes about how "the scientific method" is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply "the scientific method" in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have "a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking".

This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures:

The success of the method of science in solving almost every problem put to it will give individuals in the twenty-first century a deep confidence in its effectiveness. They will not be afraid to experiment with new ways of feeling, thinking, and acting, for they will have observed the self-corrective aspect of science. Science gives us the latest word, not the last word. They will know that if they try something new in personal or social life, the happiness it yields can be determined after sufficient experience has accumulated. They will adapt to changes in a relaxed way as they zigzag toward the achievement of their values. They will know that there are better ways of doing things than have been used in the past, and they will be determined to experiment until they have found them. They will know that most of the unhappiness of human beings in the mid-twentieth century was not due to the lack of shiny new gadgets; it was due, in part, to not using the scientific method to check out new political and social structures that could have yielded greater happiness for them

After discussing, at a high level, the implications on people and society, Fresco gets into specifics, saying that doing everything with computers, what Fresco calls a "cybernated" society, could be achieved by 1979, giving everyone a post-tax income of $100k/yr in 1969 dollars (about $800k/yr in 2022 dollars):

How would you like to have a guaranteed life income of $100,000 per year—with no taxes? And how would you like to earn this income by working a three-hour day, one day per week, for a five-year period of your life, providing you have a six-months vacation each year? Sound fantastic? Not at all with modern technology. This is not twenty-first-century pie-in-the-sky. It could probably be achieved in ten years in the United States if we applied everything we now know about automation and computers to produce a cybernated society. It probably won't be done this rapidly, for it would take some modern thinking applied in an intelligent crash program. Such a crash program was launched to develop the atomic bomb in a little over four years.

Other predictions about "cybernation":

Manufacturing will be fully automated, to the point that people need to do no more than turn on the factory to have everything run (and maintain itself)
- This will lead to "maximum efficiency"
Since there will be no need for human labor, the price of items like t-shirts will be so low that they'll be free since there's no need for items to cost anything when the element of human labor is removed
The elimination of human labor will lead to a life of leisure for everyone
Fresco notes that his previous figure of $100k/yr (1969 dollars) is meaningless and could just as easily be $1M/yr (1969 dollars) since everything will be free
A "cybernetically" manufactured item produced anywhere on earth will be able to be delivered anywhere on earth within 24 hours

Michio Kaku

By 2005
- "The complete human genome will be decoded by the year 2005, giving us an “owner’s manual” for a human being"
  - Half credit. Actually technically no as the human genome project was declared complete in 2003, but had only decoded 85% of the genome. Actually decoding the human genome took until January 2022; I'll give this half credit since many people would argue that the declared completion of the Human Genome Project should mean this prediction was correct
"During the 21st century" implied to not be something that happens at the very end, but something that's happening throughout
- "it will be difficult to be a research scientist in the future without having some working knowledge of [quantum mechanics, computer science, and biology]" due to increasing "synergy" and "cross-fertilization" between these fundamental fields
- Silicon computer chips will hit a roadblock that will be unlocked via DNA research allowing for computation on organic molecules
- Increased pace of scientific progress due to "intense synergy"
In 2020
- Commodity prices down 60% (from 1997 prices) due to wealth becoming based on knowledge, trade being global, and markets being linked electronically, continuing a long-term trend of reduced commodity prices
  - No. CRB commodity price index was up in 2020 compared to 1997 and is up further in 2022
- Microprocessors as cheap as "scrap paper" due to Moore's law scaling continuing with no speedbump until 2020 (10 cents in 2000, 1 cent in 2010, 1/10th of a cent in 2020)
  - No. Moore's law scaling curve changed and microprocessors did not, in general, cost 1 cent in 2010 or 1/10th of a cent in 2020
- Above will give us "will give us smart homes, cars, TVs, clothes, jewelry, and money"
  - No due to "and" and comments implying total ubiquity, but actually a fairly good directional prediction
- "We will speak to our appliances, and they will speak back"
  - No, due to the implied ubiquity here, but again directionally pretty good
- "the Internet will wire up the entire planet and evolve into a membrane consisting of millions of computer networks, creating an “intelligent planet.”"
  - No due on "intelligent planet"
- Moore's law / silicon scaling will continue until 2020, at which point "quantum effects will necessarily dominate and the fabled Age of Silicon will end"
  - No
- Advances in DNA sequencing will continue until roughly 2020 (before it stops); "literally thousands of organisms will have their complete DNA code unraveled"
  - Maybe? Not sure if this was hundreds or thousands; also, the lack of complete sequencing of the human genome project when it was "complete" may also have some analogue here? I didn't score this one because I don't have the background for it
- "it may be possible for anyone on earth to have their personal DNA code stored on a CD"
  - Not counting this as a prediction because it's non-falsifiable due to the use of "may"
- "Many genetic diseases will be eliminated by injecting people’s cells with the correct gene."
  - No
- "Because cancer is now being revealed to be a series of genetic mutations, large classes of cancers may be curable at last, without invasive surgery or chemotherapy"
  - Not counting this as a prediction because it's non-falsifiable due to the use of "may"
- In or near 2020, bottlenecks in DNA sequencing will stop progress of DNA sequencing
  - No
- In or near 2020, bottlenecks in silicon will stop advances in computer performance
  - No; computer performance slowed its advancement long before 2020 and then didn't stop in 2020
- The combination of the two above will (after 2020) require optical computers, molecular computers, DNA computers, and quantum computers for progress to advance in biology and computer science
  - No. Maybe some of these things will be critical in the future, but they're not necessary conditions for advancements in computing and biology in or around 2020
- Focus of biology will shift from sequencing DNA to understanding the functions of genes
  - I'm not qualified to judge this one
- something something may prove the key to solving key diseases
  - Not counting this as a prediction because it's non-falsifiable due to the use of "may"
- [many predictions based around the previous prediction that microprocessors would be as cheap as scrap paper, 1/10th of a cent or less, that also ignore the cost of everything around the processor]
  - No; collapsing these into one bullet reduces the number of incorrect predictions counted, but that shouldn't make too much difference in this case
- A variety of non-falsifiable "may" predictions about self-driving car progress by 2010 and 2020
- VR will be "an integral part of the world"
  - No
- People will use full-body suits and electric-field sensors
  - No
- Exploring simulations in virtual reality will be a critical part of how science proceeds
  - No
- A lot of predictions about how computers "may" be critical to a variety of fields
  - Not counting this as a prediction because it's non-falsifiable due to the use of "may"
- Semiconductor lithography below .1 um (100 nm) will need to switch from UV to X-rays or electrons
  - No; modern 5nm processes use EUV
- Some more "may" and "likely" non-falsifiable predictions

That gives a prediction rate of 3%. I stopped reading at this point, so may have missed a number of correct predictions. But, even if the rest of the book was full of correct predictions, the correct prediction rate is likely to be low.

There were also a variety of predictions that I didn't include because they were statements that were true in the present. For example

If the dirt road of the Internet is made up of copper wires, then the paved information highway will probably be made of laser ber optics. Lasers are the perfect quantum device, an instrument which creates beams of coherent light (light beams which vibrate in exact synchronization with each other). This exotic form of light, which does not occur naturally in the universe, is made possible by manipulating the electrons making quantum jumps between orbits within an atom

This doesn't seem like much of a prediction since, when the book was written, the "information highway" already used a lot of fiber. Throughout the book, there's a lot of mysticism around quantum-ness which is, for example, on display above and cited as a reason that microprocesses will become obsolete by 2020 (they're not "quantum") and fiber optics won't (it's quantum):

John Naisbitt

Here are a few quotes that get at the methodology of Naisbitt's hit book, Megatrends:

For the past fifteen years, I have been working with major American corporations to try to understand what is really happening in the United States by monitoring local events and behavior, because collectively what is going on locally is what is going on in America.

Despite the conceits of New York and Washington, almost nothing starts there.

In the course of my work, 1 have been overwhelmingly impressed with the extent to which America is a bottom-up society, that is, where new trends and ideas begin in cities and local communities—for example, Tampa, Hartford, San Diego, Seattle, and Denver, not New York City or Washington, D.C. My colleagues and I have studied this great country by reading its local newspapers. We have discovered that trends are generated from the bottom up, fads from the top down. The findings in this book are based on an analysis of more than 2 million local articles about local events in the cities and towns of this country during a twelve-year period.

Out of such highly localized data bases, I have watched the general outlines of a new society slowly emerge.

We learn about this society through a method called content analysis, which has its roots in World War II. During that war, intelligence experts sought to find a method for obtaining the kinds of information on enemy nations that public opinion polls would have normally provided.

Under the leadership of Paul Lazarsfeld and Harold Lasswell, later to become well-known communication theorists, it was decided that we would do an analysis of the content of the German newspapers, which we could get—although some days after publication. The strain on Germany's people, industry, and economy be- gan to show up in its newspapers, even though information about the country's supplies, production, transportation, and food situation remained secret. Over time, it was possible to piece together what was going on in Germany and to figure out whether conditions were improving or deteriorating by carefully tracking local stories about factory openings, clos- ings, and production targets, about train arrivals, departures, and delays, and so on. ... Although this method of monitoring public behavior and events continues to be the choice of the intelligence community—the United States annually spends millions of dollars in newspaper content analysis in various parts of the world it has rarely been applied commercially. In fact. The Naisbitt Group is the first, and presently the only, organization to utilize this approach in analyzing our society.

Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.

In this forced-choice situation, societies add new preoccupations and forget old ones. In keeping track of the ones that are added and the ones that are given up, we are in a sense measuring the changing share of the market that competing societal concerns command.

Evidently, societies are like human beings. A person can keep only so many problems and concerns in his or her head or heart at any one time. If new problems or concerns are introduced, some existing ones are given up. All of this is reflected in the collective news hole that becomes a mechanical representation of society sorting out its priorities.

Naisbitt rarely makes falsifiable predictions. For example, on the "information society", Naisbitt says

In our new information society, the time orientation is to the future. This is one of the reasons we are so interested in it. We must now learn from the present how to anticipate the future. When we can do that, we will understand that a trend is not destiny; we will be able to learn from the future the way we have been learning from the past.

This change in time orientation accounts for the growing popular and professional interest in the future during the 1970s. For example, the number of universities offering some type of futures-oriented degree has increased from 2 in 1969 to over 45 in 1978. Membership in the World Future Society grew from 200 in 1967 to well over 30,000 in 1982, and the number of popular and professional periodicals devoted to un- derstanding or studying the future has dramatically increased from 12 in 1965 to more than 122 in 1978.

This could be summed up as "in the future, people will think more about the future". Pretty much any case one might make that Naisbitt's claims ended up being true or false could be argued against.

In the chapter on the "information society", one of the most specific predictions is

New information technologies will at first be applied to old industrial tasks, then, gradually, give birth to new activities, processes, and products.

I'd say that this is false in the general case, but it's vague enough that you could argue it's true.

A, rare, falsifiable comment is this prediction about the price of computers

The home computer explosion is upon us. soon to be followed by a software implosion to fuel it. It is projected that by the year 2000, the cost of a home computer system (computer, printer, monitor, modem, and so forth) should only be about that of the present telephone-radio-recorder-television system.

From a quick search, it seems that reference devices cost something like $300 in 1982? That would be $535 in 2000, which wasn't really a reasonable price for a computer as well as the peripherals mentioned and implied by "and so forth".

Gerard K. O'Neill

We discussed O'Neill's predictions on space colonization in the body of this post. This section contains a bit on his other predictions.

On computers, O'Neill says that in 2081 "any major central computer will have rapid access to at least a hundred million million words of memory (the number '1' followed by 14 zeros). A computer of that memory will be no larger than a suitcase. It will be fast enough to carry out a complete operation in more more time than it takes light to travel from this page to your eye, and perhaps a tenth of that time", which is saying that a machine will have 100TWords of RAM or, to round things up simply, let's say 1PB of RAM and a clock speed of something between 300 MHz and 6 GHz, depending on how far away from your face you hold a book.

On other topics, O'Neill predicts we'll have fully automated manufacturing, people will use 6 times as much energy per capita in 2081 as in 1980, pollution other than carbon dioxide will be a solved problem, coal plants will still be used, most (50% to 95%) of energy will be renewable (with the caveat that "ground-based solar" is a "myth" that can never work, and that wind, tide, and hydro are all forms of solar that even combined with geothermal thrown in, can't reasonably provide enough energy), that solar power from satellites is the answer to then-current and future energy needs.

In The Technology Edge, O'Neill makes predictions for the 10 years following the book's publication in 1983. O'Neill says "the book is primarily based on interviews with chief executives". It was written at a time when many Americans were concerned about the impending Japanese dominance of the world. O'Neill says

As an American, I cannot help being angry — not at the Japanese for succeeding, but at the forces of timidity, shortsightedness, greed, laziness and misdirection here in America that have mired us down so badly in recent years, sapped our strength and kept us from equal achievements.

As we will see, opportunities exist now for the opening of whole new industries that can become even greater than those we have lost to the Japanese. Are we to delay and lose those too?

In an interview about the book, O'Neill said

microengineering, robotics, genetic engineering, magnetic flight, family aircraft, and space science. If the U.S. does not compete successfully in these areas, he warns, it will lose the technological and economic leadership it has enjoyed.

This seems like a big miss with both serious false positives as well as false negatives. O'Neill failed to cite industries that ended up being important to the then-continued U.S. dominance of the world economy, e.g, software, and also predicted that space and flight were much more important than they turned out to be.

On the specific mechanism, O'Neill also generally misses, e.g., in the book, O'Neill cites the lack of U.S. PhD production and people heading directly into industry as a reason the U.S. was falling behind and would continue to fall behind Japan, but in a number of important industries, like software, a lot of the major economic/business contributions have been made by people going to industry without a PhD. The U.S. didn't need to massively increase PhD production in the decades following 1983 to stay economically competitive.

There's quite a bit of text dedicated to a commonly discussed phenomenon at the time, how Japanese companies are going to wipe the floor with American and European companies because they're able to make and execute long-term plans, unlike American companies. I'll admit that it's a bit of a mystery to me how short-term thinking has worked so well for American companies and I would've, at least to date.

Patrick Dixon

Dixon opens with:

The next millennium will witness the greatest challenges to human survival ever in human history, and many of them will face us in the early years of its first century ...

The future has six faces, each of which will have a dramatic effect on all of us in the third millennium ... [Fast, Urban, Radical, Universal, Tribal, Ethical, which spells out FUTURE]

Out of these six faces cascade over 500 key expectations, specific predictions as logical workings-out of these important global trends. These range from inevitable to high probability to lower probability — but still significant enough to require strategic planning and personal preparation.

In the third millennium, things reminiscent of the previous millennium will be outdated By [variously, 2004, 2005, 2020, 2025], e.g., "the real winners will be those who tap into this huge shift — and help define it. What television producer will want to produce second millennial TV? What clothes designer dare risk his annual collection being labeled as a rehash of tired late twentieth-century fashions? ..."
- No, late 20th century fashion is very "in" right now and other 20th century fashions were "in" a decade ago
"Pre-millenialists tend to see 2000 to 2010 as just another decade. The trends of the eighties and nineties continue, just more of the same. Post-millennialists are very different. They are products of the third millennium. They live in it. They are twenty-first century people, a new age. Expect to see one of the greatest generation gaps in recent history"
- Subjective, but no. Dixon assigns huge importance to the millennium counter turning over and says things like "Few people have woken up so far to the impact of the millennium. My children are the M generation. Their entire adult existence will be lived in the third millennium ... Expect to see the M factor affect every aspect of life on earth ... The human brain makes sense of the past by dividing it into intervals: the day... month... year. Then there are decades and centuries ... And four time-events are about to hit us in the same instant. New year, decade, century, and millennium", but the counter turning over doesn't appear to have caused any particularly drastic changes.
"Expect to see millennial culture clashes between opposing trends, a world increasingly of extremes with tendencies to intolerance as groups fight to dominate the future"
- Basically yes, although his stated reasoning (not quoted) as to why this should happen at the turn of the century (as opposed to any other time) is nonsensical as it applies to all of history.
Market dominance / power will become less important as "micromarkets" become more important
- No; the bit about smaller markets existing was correct, but huge players, the big $1T companies of what Dixon calls "the third millennium", Apple, Microsoft, Google, and Amazon, have a huge amount of power of these markets and has not reduced either the economic or cultural importance of what Dixon calls "dominance"
Expect more "wild cards" over "the next 20 years" [from 1998 to 2018], such as "war, nuclear accident or the unplanned launch of nuclear weapons, vast volcanic eruptions or plagues or even a comet collision with enormous destructive power"
- No; this would've sounded much better if it included covid, but if we look at the 20 years prior to the book being published, there was the fall of the soviet union, Tiananmen Square, etc., which isn't obviously less "wild card-y" than we saw from 1998 to 2018
Less emphasis on economic growth, due to increased understanding that wealth doesn't make people happy
- No; Dixon was writing not too long after peak "growth is unsustainable and should be deliberately curtailed to benefit humanity"

That's the end of the introduction. Some of these predictions are arguably too early to call since, in places, Dixon write as if Futurewise is about the entire "third millenia", but Dixon also notes that drastic changes are expected in the first years and decades of the 21st century and these generally have not come to pass, both the specific cases where Dixon calls out particular timelines or in the cases where Dixon doesn't name a particular timeline. In general, I'm trying to only include predictions where it seems that Dixon is referring to the 2022 timeframe or before, but his general vagueness makes it difficult to make the right call 100% of the time.

The next chapter is titled "Fast" and is about the first of the six "faces" of the future.

"Expect further rapid realignments [analogous to the fall of the Soviet Union], with North Korea at the top of the list as the last outpost of Stalinism ... North Korea could crash at any moment, spilling thousands of starving refugees into China, South Korea, and Japan"
- No; there's been significant political upheaval in many places (Thailand, Arab Spring, Sudan, etc.); North Korea hasn't been in the top 10 political upheavals list, let alone at the top of the list
"Expect increasing North-South tension as emerging economies come to realize that abolishing all trade and currency restrictions in a rush for growth also places their countries at the mercy of rumors, hunches, and market opinion"
- No to there being a particular increase in North-South tension
"Expect a growing backlash against globalisaiton, with some nations reduced to "economic slavery" by massive, destabilising, currency flows
- No, due to the second part of this sentence, although highly subjective
[A bunch of unscored predictions that are gimmes about vague things continuing to happen, such as "expect large institutions to continue to make (and lose) huge fortunes trying to outguess volatile markets in these countries"]
- On the example prediction, that's quite vague and could be argued either way on the spirit of the prediction, but is very easy to satisfy as stated since it only requires (for example) two hedge funds to make major bets on volatility that either win or lose; there's list of similar "predictions" that seem extremely easy to satisfy as written that I'm not going to include
"Expect increasingly complex investment instruments to be developed, so that a commodity [from the context, this is clearly referring to actual commodities markets and not things like mortgages] sometimes rises or falls dramatically as a large market intervention is made, linked to a completely different and apparently unrelated event
- Yes, although this trend was definitely already happening and well-known when Dixon wrote his book, making this a very boring prediction
"Management theory is still immature ... expect that to change over the next two decades as rigorous statistical and analytical tools are divides to prove or disprove the key elements of success in management methods"
- No; drastically underestimates the difficulty of rigorously quantifying the impact of different management methods in a way that only someone who hasn't done serious data analysis would do
[seem to have lost a line here; sorry!]
- Yes, although this statement would be more compelling with less stated detail
"Expect 'management historians' to become sought after, analyzing industrial successes and failures during the previous Industrial Revolution and at the turn of the twentieth century
- No; some people do this kind of work, but they're not particularly sought after. The context of the statement implies they'd be sought after by CEOs or other people trying to understand how to run actual businesses, which is generally not the case
"Expect consumer surveys and market research to be sidelined by futurology-based customer profiles. Market research only tells you what people want today. What's so smart about that..."
- No; not that people don't try to predict trends, but the context for this prediction incorrectly implies that market research is trivial "anyone can go out and ask the same questions, so where's the real competitive edge?", that in the computerized world, brands are irrelevant, etc., all of which are incorrect, and of course the simple statement that market research and present-day measurement are obsolete are simply wrong.
Flat-rate global "calls" with no long-distance changes
- Yes as written since you can call people anywhere with quite a few apps, so I'll give Dixon this one, although the context implies that his reasoning was totally incorrect. For one thing, he seems to be talking about phone calls and thinks traditional phone calls will be important, but he also makes some incorrect statements about telecom cost structures, such as "measuring the time and distance of every call is so expensive as a proportion of total call costs" (which was predicted to happen because the cost of calls themselves would fall, causing the cost of metadata tracking of calls to dominate the cost of the calls themselves; even if that came to pass, the cost of tracking how long a call was and where to call was to would be tiny and, in fact, my phone bill still tracks this information even though I'm not charged for it because the cost is so small that it would be absurd not to track other than for privacy reasons)
"Expect most households in wealth nations to have several phone numbers by 2005 ... this means that most executives will have access to far more telephone lines at home than they do at work today for their personal use"
- No; there's a way to read this as some kind of prediction that was correct, but from the context, Dixon is clearly talking about people having a lot of phone numbers and phone lines and makes a statement elsewhere that implies explosive growth in the number of landline phone numbers and lines people will have at home
Mobile phones used in most places landline phones are used today
- Yes; basically totally on the nose, although he has a story about a predicted future situation that isn't right due to some incorrect guesses about how interfaces would play out
Many emerging economies will go straight to mobile and leapfrog existing technically
- Yes
Ubiquitous use of satellite phones by traveling execs / very important people by 2005
- No; many execs, VPs, etc., still impacted by incomplete cell coverage and no sat phone in 2005
"The next decade" [by 2008], cell phones will seamlessly switch to satellite coverage when necessary
- No
Phone trees will have switched from "much hated push-button systems to voice recognition" by 2002, with seamless basically perfect recognition by 2005
- No; these systems are now commonplace in 2022, but many people I know find them to be significantly worse than push-button systems
Computational power per "PC" will continue to double every 18 months indefinitely [there's a statement that implies this will continue at least through 2018, but there's no implication that this will end level off at any time after that]
- No; even at the time, people had already observed that performance scaling was moving to a slower growth curve
Future small displays will be able to be magnified
- No, or not yet anyway (if the prediction means that software zoom will be possible, that was possible and even built into operating systems well before the book was published, so that's not really a prediction about the future)
"Paper-thin display sheets by 2005"
- No
Projection displays will be in common use, replacing many uses of CRTs
- No; projectors are used today, but in many of the same applications they were used in at the time the book was written
Many CRT use cases will be replaced by lasers projected onto the retina
- No, or not yet anyway; even if this happens at some point, I would rate this as a no since this section was about what would kill the CRT and this technology was not instrumental in killing the CRT
Digital cameras rival film cameras in terms of image quality by 2020
- Yes; technically yes as written, but the way this is written implies that digital cameras will just have caught up to film cameras in 2020 when this happened quite a long time ago, so I'd say that Dixon was wrong but made this prediction vague enough that it just happens to be correct as written
For consumer use, digital cameras replace 35mm film by 2010
- Yes; but same issue as above where Dixon really underestimated how quickly digital cameras would improve
"Ultra high definition TV cameras" replace film "in most situations" by 2005
- Yes
Software will always be buggy because new chips will be released at a pace that means that programmers can't keep up with bug fixes because they need to re-write the software for new chips.
- Yes, although the reason was completely wrong. Despite the obvious trueness that software bugginess will continue for quite some time. I'm going to include more of Dixon's text here since a lot of readers are programmers who will have opinions on why computers are buggy and will be able to directly evaluate Dixon's reasoning with no additional context: "Software will always be full of bugs. Desktop computers today are so powerful that even if technology stands still it will take the world's programmers at least 20 years to export their capability to the full. The trouble is that they have less than 20 months – because by then a new generation of machines will be around ... So brand new code was written for Pentium chips. The bugs were never sorted out in the old versions and bugs in the new ones will never be either, for the same reason".
  - Dixon's reasoning as to why software is buggy is completely wrong. It is not because Intel releases a new chip and programmers have to abandon their old code and write code for the new chip. This level of incorrectness of reasoning generally holds for Dixon's comments even when he really nails a prediction and doesn't include some kind of "because" that invalidates the prediction
Computer disaster recovery will become more important, resulting in lawsuits against backup companies being a major feature of the next century
- No; not that there aren't any lawsuits, but lawsuits over backup data loss aren't a major feature of this century
Home workers will be vulnerable to data loss, will eventually "back up data on-line to computers in other cities as the ultimate security"
- Yes, although the reasoning here was incorrect. Dixon concluded this due to the ratio of hard disk sizes (>= 2GB) to floppy disk sizes (<= 2 MB), which caused him to conclude that local backups are impossible (would take more than 1000 floppy disks), but even at the time Dixon was writing, cheap, large, portable disks were available (zip drives, etc.) and tape backups were possible
Much greater expenditure on anti virus software, with "monthly updates" of antivirus software, and anti virus companies creating viruses to force people to buy anti virus software
- No; MS basically obsoleted commercial anti virus software for what was, by far, the largest platform where users bought anti virus software by providing it for free with Windows; corp spend on anti virus software is still signifcant and increases as corps own more computers, but consumer spend dropping drastically seems opposed to what Dixon was predicting
New free zones or semi-states will be created to bypass online sales tax and countries will retaliate against ISPs that provide content served from these tax havens
- No
Sex industry will be a major driver of internet technologies and technology in general "for the next 30 years" (up through 2028)
- No; porn was a major driver of internet technology up to the mid 90s by virtue of being a huge fraction of internet commerce, but this was already changing when Dixon was writing the book (IIRC, mp3 surpassed sex as the top internet search term in 1999) and the non-sex internet economy dwarfs the sex internet economy, so sex sites are no longer major drivers of tech innovation, e.g., youtube's infra drives cutting edge work in a way that pornhub's infra has no need to
The internet will end income tax as we know it by 2020 because transactions will be untraceable
- No
By 2020, sales and property taxes will have replaced income tax due to the above
- No
All new homes in western countries will be "intelligent" in 2010, which includes things like the washing machine automatically calling a repair person to get repaired when it has a problem, etc.
- No; I've lived in multiple post-2010 builds and none of them have been "intelligent"
Pervasive networking via power outlets by 2005, allowing you to plug into any power outlet "in every building anywhere in the world" to get networking
- No
PC or console as "smart home" brains by one of the above timelines
- No
Power line networking eliminates other network technologies in the home
- No
No more ordering of food by 2000; scanner in rubbish bin will detect when food is used up and automatically order food
- No; nonsensical idea even if such scanners were reliable and ubiquitous since the system would only know what food was used, not what food the person wants in the future
World will be dominated by the largest telecom companies
- No; Dixon's idea was that the importance of the internet and networks would mean that telecom companies would dominate the world, an argument analogous to when people say software companies must grow in importance because software will grow in importance; instead, telecom became a commodity
Power companies will compete with telecoms and high voltage lines will carry major long haul traffic by 2001
- No
Internet will replace the telephone
- Yes
Mobile phone costs drop so rapidly that they're free by 2000
- No; arguably yes because some cell phone providers were providing phones free with contract at one point, but once total costs were added up, these weren't cheaper than non-contract phones where those were available
Phones with direct retinal displays and voice recognition very soon (prototypes already exist)
- No
The end of books; replaced by digital books with "more than a hundred paper-thin electronic pages. Just load the text you want, settle back and enjoy"
- No; display technology isn't there, and it's unclear why something like a Kindle should have Dixon's proposed design instead of just having a one-page display
Cheap printing causes print on demand in the home to also be force in the end of books
- No; a very trendy idea in the 90s (either in the home or at local), though
Growth in internet radio; "expect thousands of amateur disc jockeys, single-issue activists, eccentrics and misfits to be broadcasting to audiences of only a few tens ot a few hundred from garages or bedrooms with virtually no equipment other than a hi-fi, a PC, modem, and a microphone, possibly with TV camera"
- No; drastically underestimated how many people would broadcast and/or stream
Mainstream TV companies will lose prime time viewership
- Not scoring this prediction because it's an extremely boring prediction; as Dixon notes, in the book, this had already started happening years before he wrote the book
By 2010, doctors will de facto be required to defer to computers for diagnoses because computer diagnoses will be so much better than human diagnoses that the legal liability for overruling the computer with human judgement will be prohibitive
- No
Surgeons will be judged on how many people die during operations, which will cause surgeons to avoid operating on patients with likely poor outcomes
- No
Increased education; "several graduate or postgraduate courses in a lifetime"
- No
Paper credentials devalued, replaced by emphasis on "skills not created by studying books"
- No
Governments set stricter targets for literacy, education, etc.
- No, or at least not in general for serious targets that are intended to be met
Many lawsuits from people who received poor education
- No
Return to single-sex schools, at least regionally in some areas
- No
"complete rethink about punishment and education, with the recognition that a no-touch policy isn't working", by 2005
- No
Collapse of black-white integration in schooling in U.S. cities
- No
College libraries become irrelevant
- No, or no more so than when the book was written, anyway
Ubiquitous video phones and video phone usage by 2005
- No
Dense multimedia and VR experiences in grocery stores
- No
General consolidation of retails, except for "corner shops", which will survive as car-use restrictions "being to bite", circa 2010 or so
- No
Blanket loyalty programs are grocery stores replaced by customized per-person programs
- No
VR dominates arcades and theme parks by 2010
- No
"all complex prototyping [for manufacturing]" done in VR by 2000
- No
Rapid prototyping from VR images
- No
Pervasive use of voice recognition will cause open offices to get redesigned by 2002
- No
Speech recognition to have replaced typing to the extent that typing is considered obsolete and inefficient by 2008, except in cases where silence is necessary
- No
Accurate handwriting recognition will exist but become irrelevant by 2008, obsoleted by speech recognition
- No
Traditional banking wiped out by the internet
- No
"millions" of people will buy and sell directly to and from each other via online marketplaces
- Not counting this because ebay alone already had 2 million users when the book was published
Traditional brokerage services will become less important over time; more trading will happen via cheap or discount brokerages, online
- Yes, but an extremely boring prediction that was already coming to pass when the book was written
Pervasive corporate espionage, an increase over prior eras, made possible by bugs becoming smaller and easier to palace, etc.
- No? Hard to judge this one, though
Pervasive internal corporate surveillance (microphones and hidden cameras everywhere, including the homes of employees), to fight corporate espionage
- No
- Retina scans commonly used to verify identity
- No
Full self-driving cars, networked with each other, etc.
- No, or not yet anyway
Cars physically linked together to form trains on the road
- No
Widespread tagging of humans with identity chips by 2010
- No

This marks the end of the "Fast" chapter. From having skimmed the rest of the book, the hit rate isn't really higher later nor is the style of reasoning any different, so I'm going to avoid doing a prediction-by-prediction grading. Instead, I'll just mention a few highlights (some quite accurate, but mostly not; not included in the prediction accuracy rate since I didn't ensure consistent or random sampling):

Extremely limited water supply by 2020, with widespread water metering, recycling of used bathwater, etc.; water so limited that major nations have conflicts over water and water is a major foreign policy instrument by 2010; waterless cleaning of fabrics, etc., by 2025
Return to "classic" pop-Christian American family and cultural values, increased stigmatization of single parent households, etc., by 2020
Major prohibition movement against smoking, drinking, psychedelic drugs, etc.
Increased risk of major disease epidemics due to higher global population and increased mobility
Due to increasing tribalism, most new wealth created by companies with <= 20 employees, of which >= 75% are family owned or controlled and started with family money
Increased global free trade
Death of "old economics" allow for (for example) low unemployment with no inflationary pressure due to combination of globalization pushing down wages and computerization causing productivity increases
Travel will have virtually no friction by 2000 due to increased automation; you'll be able to buy a plane ticket online, go to the airport, where a scanner will scan you as you walk through security without delay; you'll even be able to skip the ticket buying process and just walk directly onto a plane, at which point a system will scan an embedded smart-card in your watch or skin will allow the system to seamlessly deduct the payment from your bank account
End of left/right politics and rise of single-issue politics and parties [presumebly referring to U.S. politics here]
Environmentalism the single biggest political issue
Destruction of ozone layer causes people to avoid sun; vacations in sunny areas and beaches no longer popular
Very accurate weather predictions by 2008, due to newly collected data allowing accurate forecasting
Nuclear power dead, with zero or close to zero active reactors by 2030
Increased concern over damage / cancer from "electromagnetic fields"
Noise canceling technology wipes out unpleasant noise in cars and homes
Widespread market for human cloning, with people often raising a genetic clone of themselves instead of conceiving traditionally
Have the capability to design custom viruses / plagues that target particular organs or racial groups by 2010
Comprehensive reform of U.S. legal system to reduce / eliminate spurious lawsuits by 2010
Major growth of religions; particularly Islam and Christianity
- Globally, as well as in the U.S., where the importance of Christianity will give rise to things like "the Christian Democratic Party" and an increasing number of Christian schools
The internet helps guarantee freedom against authoritarian regimes, which can censor newspapers, radio, and TV, but not the internet
Total globalization will cause a new world religion to be created which doesn't come from old ideas and will market itself as dogmatic, exclusive, and superior to old religions
New world order with international laws and international courts; international trade impossible otherwise
"Cyberspace" has its own governance, with a "cyber-government" and calls for democracy where each email address gets a vote; nation-level governance over "cyberspace" "cannot and will not last, nor will any other benevolent dictatorship of non-elected, unrepresentative authority"

Overall accuracy, 8/79 = 10%

Toffler

Intro to Future Shock:

Another reservation has to do with the verb "will." No serious futurist deals in "predictions." These are left for television oracles and newspaper astrologers. ... Yet to enter every appropriate qualification in a book of this kind would be to bury the reader under an avalanche of maybes. Rather than do this, I have taken the liberty of speaking firmly, without hesitation, trusting that the intelligent reader will understand the stylistic problem. The word "will" should always be read as though it were preceded by "probably" or "in my opinion." Similarly, all dates applied to future events need to be taken with a grain of judgment.

[Chapter 1 is about how future shock is going to be a big deal in the future and how we're presently undergoing a revolution]

Despite the disclaimer in the intro, there are very few concrete predictions. The first that I can see is in the middle of chapter two and isn't even really a prediction, but is a statement that very weakly implies world population growth will continue at the same pace or accelerate. Chapter 1 has a lot of vague statements about how severe future shock will be, and then Chapter 2 discusses how the world is changing at an unprecedented rate and cite a population doubling time eleven years to note how much this must change the world since it would require the equivalent of a new Tokyo, Hamburg, Rome, and Rangoon in eleven years, illustrating how shockingly rapid the world is changing. There's a nod to the creation of future subterranean cities, but stated weakly enough that it can't really be called a prediction.

There's a similar implicit prediction that economic growth will continue with a doubling time of fifteen years, meaning that by the time someone is thirty, the amount of stuff (and it's phrased as amount of stuff and not wealth) will have quadrupled and then by the time someone is seventy it will have increased by a factor of thirty two. This is a stronger implicit prediction than the previous one since the phrasing implies this growth rate should continue for at least seventy years and is perhaps the first actual prediction in the book.

Another such prediction appears later in the chapter, on the speed of travel, which took millions of years to reach 100 mph in the 1880s, only fifty-eight years to reach 400 mph in 1938, and then twenty to double again, and then not much more time before rockets could propel people at 4000 mph and people circled the earth at 18000 mph. Strictly speaking, no prediction is made as to the speed of travel in the future, but since the two chapters are about how this increased rate of change will, in the future, cause future shock, citing examples where exponential growth is expected to level off as reasons the future is going to cause future shock would be silly and implicit in the citation is that the speed of travel will continue to grow.

Toffler then goes on to cite a series of examples where, at previous times in history, the time between having an idea and applying the idea was large, shrinking as we get closer to the present, where it's very low because "we have, with the passage of time, invented all sorts of social devices to hasten the process".

Through Chapter 4, Toffler continued to avoid making concrete, specific predictions, but also implied that buildings would be more temporary and, in the United States specifically, there would be an increase in tearing down old buildings (e.g., ten year old apartment buildings) to build new ones because new buildings would be so much better than old ones that it wouldn't make sense to live in old buildings, and that schools will move to using temporary buildings that are quickly dismantled after they're no longer necessary, perhaps often using geodesic domes.

Also, a general increase in modularity, which parts of buildings being swapped out to allow more rapid changes during the short, 25-year life, of a modern building.

Another implied prediction is that everything will be rented instead of owned, with specific examples cited of cars and homes, with an extremely rapid growth in the rate of car rentership over ownership continuing through the 70s in the then-near future.

Through Chapter 5, Toffler continued to avoid making specific predictions, but very strongly implies that the amount of travel people will do for mundane tasks such as committing will hugely increase, making location essentially irrelevant. As with previous implied predictions, this is based on a very rapid increase in what Toffler views as a trend and is implicitly a prediction of the then very near future, citing people who commute 50k miles in a year and 120 miles in a day and citing stats showing that miles traveled have been increasing. When it comes to an actual prediction, Toffler makes the vague comment

among those I have characterized as "the people of the future," commuting, traveling, and regularly relocating one's family have become second nature.

Which, if read very strictly, not technically not a prediction about the future, although it can be implied that people in the future will commute and travel much more.

In a similar implicit prediction, Toffler implies that, in the future, corporations will order highly skilled workers to move to whatever location most benefits the corporation and they'll have no choice but to obey if they want to have a career.

In Chapter 6, in a rare concrete prediction, Toffler writes

When asked "What do you do?" the super-industrial man will label himself not in terms of his present (transient) job, but in terms of his trajectory type, the overall pattern of his work life.

Some obsolete example job types that Toffler presents are "machine operator", "sales clerk", and "computer programmer". Implicit in this section is that career changes will be so rapid and so frequent that the concept of being "a computer programmer" will be meaningless in the future. It's also implied that the half-life of knowledge will be so short in the future that people will no longer accumulate useful knowledge over the course of their career in the future and people, especially in management, shouldn't expect to move up with age and may be expected to move down with age as their knowledge becomes obsolete and they end up in "simpler" jobs.

It's also implied that more people will work for temp agencies, replacing what would previously have been full-time roles. The book is highly U.S. centric and, in the book, this is considered positive for workers (it will give people more flexibility) without mentioning any of the downsides (lack of benefits, etc.). The chapter has some actual explicit predictions about how people will connect to family and friends, but the predictions are vague enough that it's difficult to say if the prediction has been satisfied or not.

In chapter 7, Toffler says that bureaucracies will be replaced by "adhocracies". Where bureaucracies had top down power and put people into well-defined roles, in adhocracies, roles will change so frequently that people won't get stuck into defined roles.. Toffler notes that a concern some people have about the future is that, since organizations will get larger and more powerful, people will feel like cogs, but this concern is unwarranted because adhocracy will replace bureaucracy. This will also mean an end to top-down direction because the rapid pace of innovation in the future won't leave time for any top down decision making, giving workers power. Furthermore, computers will automate all mundane and routine work, leaving no more need for bureaucracy because bureaucracy will only be needed to control large groups of people doing routine work and has no place in non-routine work. It's implied that "in the next twenty-five to fifty years [we will] participate in the end of bureaucracy". As Toffler was writing in 1970, his timeframe for that prediction is 1995 to 2020.

Chapter 8 takes the theme of everything being quicker and turns it to culture. Toffler predicts that celebrities, politicians, sports stars, famous fictional characters, best selling books, pieces of art, knowledge, etc., will all have much shorter careers and/or durations of relevance in the future. Also, new, widely used, words will be coined more rapidly than in the past.

Chapter 9 takes the theme of everything accelerating and notes that social structures and governments are poised to break down under the pressure of rapid change, as evidenced by unrest in Berlin, New York, Turin, Tokyo, Washington, and Chicago. It's possible this is what Toffler is using to take credit for predicting the fall of the Soviet Union?

Under the subheading "The New Atlantis", Toffler predicts an intense race to own the bottom of the ocean and the associated marine life there, with entire new industries springing up to process the ocean's output. "Aquaculture" will be as important as "agriculture", new textiles, drugs, etc., will come from the ocean. This will be a new frontier, akin to the American frontier, people will colonize the ocean. Toffler says "If all this sounds too far off it is sobering to note that Dr. Walter L. Robb, a scientist at General Electric has already kept a hamster alive under water by enclosing it in a box that is, in effect, an artificial gill--a synthetic membrane that extracts air from the surrounding water while keeping the water out." Toffler gives the timeline for ocean colonization as "long before the arrival of A.D. 2000".

Toffler also predicts control over the weather starting in the 70s, that "It is clearly only a matter of years" before women are able to birth children "without the discomfort of pregnancy".

I stopped reading at this point because the chapters all seem very similar to each other, applying the same reasoning to different areas and the rate of accuracy of predictions didn't seem likely to increase in later chapters.

I used web.archive.org to pull an older list because the current list of futurists is far too long for people to evaluate. I clicked on an arbitrary time in the past on archive.org and that list seemed to be short enough to evaluate (though, given the length of this post, perhaps that's not really true) and then looked at those futurists. ^[return]
While there are cases where people can make great predictions or otherwise show off expertise while making "cocktail party idea" level statements because it's possible to have a finely honed intuition without being able to verbalize the intuition, developing that kind of intuition requires taking negative feedback seriously in order to train your intuition, which is the opposite of what we observed with the futurists discussed in this post. ^[return]
Ballmer is laughing with incredulity when he says this; $500 is too expensive for phone and will be the most expensive phone by far; a phone without a keyboard won't appeal to business users and won't be useful for writing emails; you can get "great" Windows Phone devices like the Motorola QPhone for $100, which will do everything (messaging, email, etc.), etc.

You can see these kinds of futurist-caliber predictions all over the place in big companies. For example, on internal G+ at Google, Steve Yegge made a number of quite accurate predictions about what would happen with various major components of Google, such as Google cloud. If you read comments from people who are fairly senior, many disagreed with Yegge for reasons that I would say were fairly transparently bad at the time and were later proven to be incorrect by events. There's a sense in which you can say this means that what's going to happen isn't so obvious even with the right information, but this really depends on what you mean by obvious.

A kind of anti-easter egg in Tetlock's Superforecasting is that Tetlock makes the "smart contrarian" case that the Ballmer quote is unjustly attacked since worldwide iPhone marketshare isn't all that high and he also claims that Ballmer is making a fairly measured statement that's been taken out of context, which seems plausible if you read the book and look at the out of context quote Tetlock uses but is obviously untrue if you watch the interview the quote comes from. Tetlock has mentioned that he's not a superforecaster and has basically said that he doesn't have the patience necessary to be one, so I don't hold this against him, but I do find it a bit funny that this bogus Freakonomics-style contrarian "refutation" is in this book that discusses, at great length, how important it is to understand the topic you're discussing.
^[return]
Although this is really a topic for another post, I'll note that longtermists not only often operate with the same level of certainty, but also on the exact same topics, e.g., in 2001, noted longetermist Eliezer Yudkowsky said the following in a document describing Flare, his new programming language:

A new programming language has to be really good to survive. A new language needs to represent a quantum leap just to be in the game. Well, we're going to be up-front about this: Flare is really good. There are concepts in Flare that have never been seen before. We expect to be able to solve problems in Flare that cannot realistically be solved in any other language. ... Back in the good old days, it may have made sense to write "efficient" programming languages. This, however, is a new age. The age of microwave ovens and instant coffee. The age of six-month-old companies, twenty-two-year-old CEOs and Moore's Law. The age of fiber optics. The age of speed. ... "Efficiency" is the property that determines how much hardware you need, and "scalability" is the property that determines whether you can throw more hardware resources at the problem. In extreme cases, lack of scalability may defeat some problems entirely; for example, any program built around 32-bit pointers may not be able to scale at all past 4GB of memory space. Such a lack of scalability forces programmer efforts to be spent on efficiency - on doing more and more with the mere 4GB of memory available. Had the hardware and software been scalable, however, more RAM could have been bought; this is not necessarily cheap but it is usually cheaper than buying another programmer. ... Scalability also determines how well a program or a language ages with time. Imposing a hard limit of 640K on memory or 4GB on disk drives may not seem absurd when the decision is made, but the inexorable progress of Moore's Law and its corollaries inevitably bumps up against such limits. ... Flare is a language built around the philosophy that it is acceptable to sacrifice efficiency in favor of scalability. What is important is not squeezing every last scrap of performance out of current hardware, but rather preserving the ability to throw hardware at the problem. As long as scalability is preserved, it is also acceptable for Flare to do complex, MIPsucking things in order to make things easier for the programmer. In the dawn days of computing, most computing tasks ran up against the limit of available hardware, and so it was necessary to spend a lot of time on optimizing efficiency just to make computing a bearable experience. Today, most simple programs will run pretty quickly (instantly, from the user's perspective), whether written in a fast language or a slow language. If a program is slow, the limiting factor is likely to be memory bandwidth, disk access, or Internet operations, rather than RAM usage or CPU load. ... Scalability often comes at a cost in efficiency. Writing a program that can be parallelized traditionally comes at a cost in memory barrier instructions and acquisition of synchronization locks. For small N, O(N) or O(N**2) solutions are sometimes faster than the scalable O(C) or O(N) solutions. A two-way linked list allows for constant-time insertion or deletion, but at a cost in RAM, and at the cost of making the list more awkward (O(N) instead of O(C) or O(log N)) for other operations such as indexed lookup. Tracking Flare's two-way references through a two-way linked list maintained on the target burns RAM to maintain the scalability of adding or deleting a reference. Where only ten references exist, an ordinary vector type would be less complicated and just as fast, or faster. Using a two-way linked list adds complication and takes some additional computing power in the smallest case, and buys back the theoretical capability to scale to thousands or millions of references pointing at a single target... though perhaps for such an extreme case, further complication might be necessary.

As with the other Moore's law predictions of the era, this is not only wrong in retrospect, it was so obviously wrong that undergraduates were taught why this was wrong.
^[return]
My personal experience is that, as large corporations have gotten more powerful, the customer experience has often gotten significantly worse as I'm further removed from a human who feels empowered to do anything to help me when I run into a real issue. And the only reason my experience can be described as merely significantly worse and not much worse is that I have enough Twitter followers that when I run into a bug that makes a major corporation's product stop working for me entirely (which happened twice in the past year), I can post about it on Twitter and it's likely someone will escalate the issue enough that it will get fixed.

In 2005, when I interacted with corporations, it was likely that I was either directly interacting with someone who could handle whatever issue I had or that I only needed a single level of escalation to get there. And, in the event that the issue wasn't solvable (which never happened to me, but could happen), the market was fragmented enough that I could just go use another company's product or service. More recently, in the two cases where I had to go resort to getting support via Twitter, one of the products essentially has no peers, so my ability to use any product or service of that kind would have ended if I wasn't able to find a friend of a friend to help me or if I couldn't craft some kind of viral video / blog post / tweet / etc. In the other case, there are two companies in the space, but one is much larger and offers effective service over a wider area, so I would've lost the ability to use an entire class of product or service in many areas with no recourse other than "going viral". There isn't a simple way to quantify whether or not this effect is "larger than" the improvements which have occurred and if, on balance, consumer experiences have improved or regressed, but there are enough complaints about how widespread this kind of thing is that degraded experiences should at least have some weight in the discussion, and Kurzweil assigns them zero weight.
^[return]
If it turns out that longtermists and other current predictors of the future very heavily rely on the same techniques as futurists past, I may not write up the analysis since it will be quite long and I don't think it's very interesting to write up a very long list of obvious blunders. Per the comment above about how this post would've been more interesting if it focused on business leaders, it's a lot more interesting to write up an analysis if there are some people using reasonable methodologies that can be compared and contrasted.

Conversely, if people predicting the future don't rely on the techniques discussed here at all, then an analysis informed by futurist methods would be a fairly straightforward negative result that could be a short Twitter thread or a very short post. As Catherine Olsson points out, longtermists draw from a variety of intellectual traditions (and I'm not close enough to longtermist culture to personally have an opinion of the relative weights of these traditions):

Modern 'longtermism' draws on a handful of intellectual traditions, including historical 'futurist' thinking, as well as other influences ranging from academic philosophy of population ethics to Berkeley rationalist culture.

To the extent that 'longtermists' today are using similar prediction methods to historical 'futurists' in particular, [this post] bodes poorly for longtermists' ability to anticipate technological developments in the coming decades

If there's a serious "part 2" to this post, we'll look at this idea and others but, for the reasons mentioned above, there may not be much of a "part 2" to this post.
^[return]
This post by nostalgebraist gives another example of this, where metaculus uses Brier scores for scoring, just like Tetlock did for his Superforecasting work. This gives it an air of credibility until you look at what's actually being computed, which is not something that's meaningful to take a Brier score over, meaning the result of using this rigorous, Superforecasting-approved, technique is nonsense; exactly the kind of thing McElreath warns about. ^[return]

In defense of simple architectures

Wed, 06 Apr 2022 00:00:00 +0000

Wave is a $1.7B company with 70 engineers¹ whose product is a CRUD app that adds and subtracts numbers. In keeping with this, our architecture is a standard CRUD app architecture, a Python monolith on top of Postgres. Starting with a simple architecture and solving problems in simple ways where possible has allowed us to scale to this size while engineers mostly focus on work that delivers value to users.

Stackoverflow scaled up a monolith to good effect (2013 architecture / 2016 architecture), eventually getting acquired for $1.8B. If we look at traffic instead of market cap, Stackoverflow is among the top 100 highest traffic sites on the internet (for many other examples of valuable companies that were built on top of monoliths, see the replies to this Twitter thread. We don’t have a lot of web traffic because we’re a mobile app, but Alexa still puts our website in the top 75k even though our website is basically just a way for people to find the app and most people don’t even find the app through our website).

There are some kinds of applications that have demands that would make a simple monolith on top of a boring database a non-starter but, for most kinds of applications, even at top-100 site levels of traffic, computers are fast enough that high-traffic apps can be served with simple architectures, which can generally be created more cheaply and easily than complex architectures.

Despite the unreasonable effectiveness of simple architectures, most press goes to complex architectures. For example, at a recent generalist tech conference, there were six talks on how to build or deal with side effects of complex, microservice-based, architectures and zero on how one might build out a simple monolith. There were more talks on quantum computing (one) than talks on monoliths (zero). Larger conferences are similar; a recent enterprise-oriented conference in SF had a double-digit number of talks on dealing with the complexity of a sophisticated architecture and zero on how to build a simple monolith. Something that was striking to me the last time I attended that conference is how many attendees who worked at enterprises with low-scale applications that could’ve been built with simple architectures had copied the latest and greatest sophisticated techniques that are popular on the conference circuit and HN.

Our architecture is so simple I’m not even going to bother with an architectural diagram. Instead, I’ll discuss a few boring things we do that help us keep things boring.

We’re currently using boring, synchronous, Python, which means that our server processes block while waiting for I/O, like network requests. We previously tried Eventlet, an async framework that would, in theory, let us get more efficiency out of Python, but ran into so many bugs that we decided the CPU and latency cost of waiting for events wasn’t worth the operational pain we had to take on to deal with Eventlet issues. The are other well-known async frameworks for Python, but users of those at scale often also report significant fallout from using those frameworks at scale. Using synchronous Python is expensive, in the sense that we pay for CPU that does nothing but wait during network requests, but since we’re only handling billions of requests a month (for now), the cost of this is low even when using a slow language, like Python, and paying retail public cloud prices. The cost of our engineering team completely dominates the cost of the systems we operate².

Rather than take on the complexity of making our monolith async we farm out long-running tasks (that we don’t want responses to block on) to a queue.

A place where we can’t be as boring as we’d like is with our on-prem datacenters. When we were operating solely in Senegal and Côte d'Ivoire, we operated fully in the cloud, but as we expand into Uganda (and more countries in the future), we’re having to split our backend and deploy on-prem to comply with local data residency laws and regulations. That's not exactly a simple operation, but as anyone who's done the same thing with a complex service-oriented architecture knows, this operation is much simpler than it would've been if we had a complex service-oriented architecture.

Another area is with software we’ve had to build (instead of buy). When we started out, we strongly preferred buying software over building it because a team of only a few engineers can’t afford the time cost of building everything. That was the right choice at the time even though the “buy” option generally gives you tools that don’t work. In cases where vendors can’t be convinced to fix showstopping bugs that are critical blockers for us, it does make sense to build more of our own tools and maintain in-house expertise in more areas, in contradiction to the standard advice that a company should only choose to “build” in its core competency. Much of that complexity is complexity that we don’t want to take on, but in some product categories, even after fairly extensive research we haven’t found any vendor that seems likely to provide a product that works for us. To be fair to our vendors, the problem they’d need to solve to deliver a working solution to us is much more complex than the problem we need to solve since our vendors are taking on the complexity of solving a problem for every customer, whereas we only need to solve the problem for one customer, ourselves.

A mistake we made in the first few months of operation that has some cost today was not carefully delimiting the boundaries of database transactions. In Wave’s codebase, the SQLAlchemy database session is a request-global variable; it implicitly begins a new database transaction any time a DB object’s attribute is accessed, and any function in Wave’s codebase can call commit on the session, causing it to commit all pending updates. This makes it difficult to control the time at which database updates occur, which increases our rate of subtle data-integrity bugs, as well as making it harder to lean on the database to build things like idempotency keys or a transactionally-staged job drain. It also increases our risk of accidentally holding open long-running database transactions, which can make schema migrations operationally difficult.

Some choices that we’re unsure about (in that these are things we’re either thinking about changing, or would recommend to other teams starting from scratch to consider a different approach) were using RabbitMQ (for our purposes, Redis would probably work equally well as a task queue and just using Redis would reduce operational burden), using Celery (which is overcomplicated for our use case and has been implicated in several outages e.g. due to backwards compatibility issues during version upgrades), using SQLAlchemy (which makes it hard for developers to understand what database queries their code is going to emit, leading to various situations that are hard to debug and involve unnecessary operational pain, especially related to the above point about database transaction boundaries), and using Python (which was the right initial choice because of our founding CTO’s technical background, but its concurrency support, performance, and extensive dynamism make us question whether it’s the right choice for a large-scale backend codebase). None of these was a major mistake, and for some (e.g. Python) the downsides are minimal enough that it’s cheaper for us to continue to pay the increased maintenance burden than to invest in migrating to something theoretically better, but if we were starting a similar codebase from scratch today we’d think hard about whether they were the right choice.

Some areas where we’re happy with our choices even though they may not sound like the simplest feasible solution is with our API, where we use GraphQL, with our transport protocols, where we had a custom protocol for a while, and our host management, where we use Kubernetes. For our transport protocols, we used to use a custom protocol that runs on top of UDP, with an SMS and USSD fallback, for the performance reasons described in this talk. With the rollout of HTTP/3, we’ve been able to replace our custom protocol with HTTP/3 and we generally only need USSD for events like the recent internet shutdowns in Mali.

As for using GraphQL, we believe the pros outweigh the cons for us:

Pros:

Self-documentation of exact return type
Code generation of exact return type leads to safer clients
GraphiQL interactive explorer is a productivity win
Our various apps (user app, support app, Wave agent app, etc.) can mostly share one API, reducing complexity
Composable query language allows clients to fetch exactly the data they need in a single packet roundtrip without needing to build a large number of special-purpose endpoints
Eliminates bikeshedding over what counts as a RESTful API

Cons:

GraphQL libraries weren’t great when we adopted GraphQL (the base Python library was a port of the Javascript one so not Pythonic, Graphene required a lot of boilerplate, and Apollo-Android produced very poorly optimized code)
Default GQL encoding is redundant and we care a lot about limiting size because many of our customers have low bandwidth

As for Kubernetes, we use Kubernetes because knew that, if the business was successful (which it has been) and we kept expanding, we’d eventually expand to countries that require us to operate our services in country. The exact regulations vary by country, but we’re already expanding into one major African market that requires we operate our “primary datacenter” in the country and there are others with regulations that, e.g., require us to be able to fail over to a datacenter in the country.

An area where there’s unavoidable complexity for us is with telecom integrations. In theory, we would use a SaaS SMS provider for everything, but the major SaaS SMS provider doesn’t operate everywhere in Africa and the cost of using them everywhere would be prohibitive³. The earlier comment on how the compensation cost of engineers dominates the cost of our systems wouldn’t be true if we used a SaaS SMS provider for all of our SMS needs; the team that provides telecom integrations pays for itself many times over.

By keeping our application architecture as simple as possible, we can spend our complexity (and headcount) budget in places where there’s complexity that it benefits our business to take on. Taking the idea of doing things as simply as possible unless there’s a strong reason to add complexity has allowed us to build a fairly large business with not all that many engineers despite running an African finance business, which is generally believed to be a tough business to get into, which we’ll discuss in a future post (one of our earliest and most helpful advisers, who gave us advice that was critical in Wave’s success, initially suggested that Wave was a bad business idea and the founders should pick another one because he foresaw so many potential difficulties).

Thanks to Ben Kuhn, Sierra Rotimi-Williams, June Seif, Kamal Marhubi, Ruthie Byers, Lincoln Quirk, Calum Ball, John Hergenroeder, Bill Mill, Sophia Wisdom, and Finbarr Timbers for comments/corrections/discussion.

If you want to compute a ratio, we had closer to 40 engineers when we last fundraised and were valued at $1.7B. ^[return]
There are business models for which this wouldn't be true, e.g., if we were an ad-supported social media company, the level of traffic we'd need to support our company as it grows would be large enough that we'd incur a significant financial cost if we didn't spend a significant fraction of our engineering time on optimization and cost reduction work. But, as a company that charges real money for a significant fraction of interactions with an app, our computational load per unit of revenue is very low compared to a social media company and it's likely that this will be a minor concern for us until we're well over an order of magnitude larger than we are now; it's not even clear that this would be a major concern if we were two orders of magnitude larger, although it would definitely be a concern at three orders of magnitude growth. ^[return]
Despite the classic advice about how one shouldn’t compete on price, we (among many other things) do compete on price and therefore must care about costs. We’ve driven down the cost of mobile money in Africa and our competitors have had to slash their prices to match our prices, which we view as a positive value for the world ^[return]

Why is it so hard to buy things that work well?

Mon, 14 Mar 2022 00:00:00 +0000

There's a cocktail party version of the efficient markets hypothesis I frequently hear that's basically, "markets enforce efficiency, so it's not possible that a company can have some major inefficiency and survive". We've previously discussed Marc Andreessen's quote that tech hiring can't be inefficient here and here:

Let's launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that's the case. ... No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They're like lying on the beach gasping because they can't get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.

Variants of this idea that I frequently hear engineers and VCs repeat involve companies being efficient and/or products being basically as good as possible because, if it were possible for them to be better, someone would've outcompeted them and done it already¹.

There's a vague plausibility to that kind of statement, which is why it's a debate I've often heard come up in casual conversation, where one person will point out some obvious company inefficiency or product error and someone else will respond that, if it's so obvious, someone at the company would have fixed the issue or another company would've come along and won based on being more efficient or better. Talking purely abstractly, it's hard to settle the debate, but things are clearer if we look at some specifics, as in the two examples above about hiring, where we can observe that, whatever abstract arguments people make, inefficiencies persisted for decades.

When it comes to buying products and services, at a personal level, most people I know who've checked the work of people they've hired for things like home renovation or accounting have found grievous errors in the work. Although it's possible to find people who don't do shoddy work, it's generally difficult for someone who isn't an expert in the field to determine if someone is going to do shoddy work in the field. You can try to get better quality by paying more, but once you get out of the very bottom end of the market, it's frequently unclear how to trade money for quality, e.g., my friends and colleagues who've gone with large, brand name, accounting firms have paid much more than people who go with small, local, accountants and gotten a higher error rate; as a strategy, trying expensive local accountants hasn't really fared much better. The good accountants are typically somewhat expensive, but they're generally not charging the highest rates and only a small percentage of somewhat expensive accountants are good.

More generally, in many markets, consumers are uninformed and it's fairly difficult to figure out which products are even half decent, let alone good. When people happen to choose a product or service that's right for them, it's often for the wrong reasons. For example, in my social circles, there have been two waves of people migrating from iPhones to Android phones over the past few years. Both waves happened due to Apple PR snafus which caused a lot of people to think that iPhones were terrible at something when, in fact, they were better at that thing than Android phones. Luckily, iPhones aren't strictly superior to Android phones and many people who switched got a device that was better for them because they were previously using an iPhone due to good Apple PR, causing their errors to cancel out. But, when people are mostly making decisions off of marketing and PR and don't have access to good information, there's no particular reason to think that a product being generally better or even strictly superior will result in that winning and the worse product losing. In capital markets, we don't need all that many informed participants to think that some form of the efficient market hypothesis holds ensuring "prices reflect all available information". It's a truism that published results about market inefficiencies stop being true the moment they're published because people exploit the inefficiency until it disappears. But with the job market examples, even though firms can take advantage of mispriced labor, as Greenspan famously did before becoming Chairman of the fed, inefficiencies can persist:

Townsend-Greenspan was unusual for an economics firm in that the men worked for the women (we had about twenty-five employees in all). My hiring of women economists was not motivated by women's liberation. It just made great business sense. I valued men and women equally, and found that because other employers did not, good women economists were less expensive than men. Hiring women . . . gave Townsend-Greenspan higher-quality work for the same money . . .

But as we also saw, individual firms exploiting mispriced labor have a limited demand for labor and inefficiencies can persist for decades because the firms that are acting on "all available information" don't buy enough labor to move the price of mispriced people to where it would be if most or all firms were acting rationally.

In the abstract, it seems that, with products and services, inefficiencies should also be able to persist for a long time since, similarly, there also isn't a mechanism that allows actors in the system to exploit the inefficiency in a way that directly converts money into more money, and sometimes there isn't really even a mechanism to make almost any money at all. For example, if you observe that it's silly for people to move from iPhones to Android phones because they think that Apple is engaging in nefarious planned obsolescence when Android devices generally become obsolete more quickly, due to a combination of iPhones getting updates for longer and iPhones being faster at every price point they compete at, allowing the phone to be used on bloated sites for longer, you can't really make money off of this observation. This is unlike a mispriced asset that you can buy derivatives of to make money (in expectation).

A common suggestion to the problem of not knowing what product or service is good is to ask an expert in the field or a credentialed person, but this often fails as well. For example, a friend of mine had trouble sleeping because his window air conditioner was loud and would wake him up when it turned on. He asked a trusted friend of his who works on air conditioners if this could be improved by getting a newer air conditioner and his friend said "no; air conditioners are basically all the same". But any consumer who's compared items with motors in them would immediately know that this is false. Engineers have gotten much better at producing quieter devices when holding power and cost constant. My friend eventually bought a newer, quieter, air conditioner, which solved his sleep problem, but he had the problem for longer than he needed to because he assumed that someone whose job it is to work on air conditioners would give him non-terrible advice about air conditioners. If my friend were an expert on air conditioners or had compared the noise levels of otherwise comparable consumer products over time, he could've figured out that he shouldn't trust his friend, but if he had that level of expertise, he wouldn't have needed advice in the first place.

So far, we've looked at the difficulty of getting the right product or service at a personal level, but this problem also exists at the firm level and is often worse because the markets tend to be thinner, with fewer products available as well as opaque, "call us" pricing. Some commonly repeated advice is that firms should focus on their "core competencies" and outsource everything else (e.g., Joel Spolsky, Gene Kim, Will Larson, Camille Fournier, etc., all say this), but if we look mid-sized tech companies, we can see that they often need to have in-house expertise that's far outside what anyone would consider their core competency unless, e.g., every social media company has kernel expertise as a core competency. In principle, firms can outsource this kind of work, but people I know who've relied on outsourcing, e.g., kernel expertise to consultants or application engineers on a support contract, have been very unhappy with the results compared to what they can get by hiring dedicated engineers, both in absolute terms (support frequently doesn't come up with a satisfactory resolution in weeks or months, even when it's one a good engineer could solve in days) and for the money (despite engineers being expensive, large support contracts can often cost more than an engineer while delivering worse service than an engineer).

This problem exists not only for support but also for products a company could buy instead of build. For example, Ben Kuhn, the CTO of Wave, has a Twitter thread about some of the issues we've run into at Wave, with a couple of followups. Ben now believes that one of the big mistakes he made as CTO was not putting much more effort into vendor selection, even when the decision appeared to be a slam dunk, and more strongly considering moving many systems to custom in-house versions sooner. Even after selecting the consensus best product in the space from the leading (as in largest and most respected) firm, and using the main offering the company has, the product often not only doesn't work but, by design, can't work.

For example, we tried "buy" instead of "build" for a product that syncs data from Postgres to Snowflake. Syncing from Postrgres is the main offering (as in the offering with the most customers) from a leading data sync company, and we found that it would lose data, duplicate data, and corrupt data. After digging into it, it turns out that the product has a design that, among other issues, relies on the data source being able to seek backwards on its changelog. But Postgres throws changelogs away once they're consumed, so the Postgres data source can't support this operation. When their product attempts to do this and the operation fails, we end up with the sync getting "stuck", needing manual intervention from the vendor's operator and/or data loss. Since our data is still on Postgres, it's possible to recover from this by doing a full resync, but the data sync product tops out at 5MB/s for reasons that appear to be unknown to them, so a full resync can take days even on databases that aren't all that large. Resyncs will also silently drop and corrupt data, so multiple cycles of full resyncs followed by data integrity checks are sometimes necessary to recover from data corruption, which can take weeks. Despite being widely recommended and the leading product in the space, the product has a number of major design flaws that mean that it literally cannot work.

This isn't so different from Mongo or other products that had fundamental design flaws that caused severe data loss, with the main difference being that, in most areas, there isn't a Kyle Kingsbury who spends years publishing tests on various products in the field, patiently responding to bogus claims about correctness until the PR backlash caused companies in the field to start taking correctness seriously. Without that pressure, most software products basically don't work, hence the Twitter threads from Ben, above, where he notes that the "buy" solutions you might want to choose mostly don't work². Of course, at our scale, there are many things we're not going to build any time soon, like CPUs, but, for many things where the received wisdom is to "buy", "build" seems like a reasonable option. This is even true for larger companies and building CPUs. Fifteen years ago, high-performance (as in, non-embedded level of performance) CPUs were a canonical example of something it would be considered bonkers to build in-house, absurd for even the largest software companies, but Apple and Amazon have been able to produce best-in-class CPUs on the dimensions they're optimizing for, for predictable reasons³.

This isn't just an issue that impacts tech companies; we see this across many different industries. For example, any company that wants to mail items to customers has to either implement shipping themselves or deal with the fallout of having unreliable shipping. As a user, whether or not packages get shipped to you depends a lot on where you live and what kind of building you live in.

When I've lived in a house, packages have usually arrived regardless of the shipper (although they've often arrived late). But, since moving into apartment buildings, some buildings just don't get deliveries from certain delivery services. Once, I lived in a building where the postal service didn't deliver mail properly and I didn't get a lot of mail (although I frequently got mail addressed to other people in the building as well as people elsewhere). More commonly, UPS and Fedex usually won't attempt to deliver and will just put a bunch of notices up on the building door for all the packages they didn't deliver, where the notice falsely indicates that the person wasn't home and correctly indicates that, to get the package, the person has to go to some pick-up location to get the package.

For a while, I lived in a city where Amazon used 3rd-party commercial courier services to do last-mile shipping for same-day delivery. The services they used were famous for marking things as delivered without delivering the item for days, making "same day" shipping slower than next day or even two day shipping. Once, I naively contacted Amazon support because my package had been marked as delivered but wasn't delivered. Support, using a standard script supplied to them by Amazon, told me that I should contact them again three days after the package was marked as delivered because couriers often mark packages as delivered without delivering them, but they often deliver the package within a few days. Amazon knew that the courier service they were using didn't really even try to deliver packages⁴ promptly and the only short-term mitigation available to them was to tell support to tell people that they shouldn't expect that packages have arrived when they've been marked as delivered.

Amazon eventually solved this problem by having their own delivery people or using, by commercial shipping standards, an extremely expensive service (Apple has done for same-day delivery)⁵. At scale, there's no commercial service you can pay for that will reliably attempt to deliver packages. If you want a service that actually works, you're generally on the hook for building it yourself, just like in the software world. My local grocery store tried to outsource this to DoorDash. I've tried delivery 3 times from my grocery store and my groceries have showed up 2 out of 3 times, which is well below what most people would consider an acceptable hit rate for grocery delivery. Having to build instead of buy to get reliability is a huge drag on productivity, especially for smaller companies (e.g., it's not possible for small shops that want to compete with Amazon and mail products to customers to have reliable delivery since they can't build out their own delivery service).

The amount of waste generated by the inability to farm out services is staggering and I've seen it everywhere I've worked. An example from another industry: when I worked at a small chip startup, we had in-house capability to do end-to-end chip processing (with the exception of having its own fabs), which is unusual for a small chip startup. When the first wafer of a new design came off of a fab, we'd have the wafer flown to us on a flight, at which point someone would use a wafer saw to cut the wafer into individual chips so we could start testing ASAP. This was often considered absurd in the same way that it would be considered absurd for a small software startup to manage its own on-prem hardware. After all, the wafer saw and the expertise necessary to go from a wafer to a working chip will be idle over 99% of the time. Having full-time equipment and expertise that you use less than 1% of the time is a classic example of the kind of thing you should outsource, but if you price out having people competent to do this plus having the equipment available to do it, even at fairly low volumes, it's cheaper to do it in-house even if the equipment and expertise for it are idle 99% of the time. More importantly, you'll get much better service (faster turnaround) in house, letting you ship at a higher cadence. I've both worked at companies that have tried to contract this kind of thing out as well as talked with many people who've done that and you get slower, less reliable, service at a higher cost.

Likewise with chip software tooling; despite it being standard to outsource tooling to large EDA vendors, we got a lot of mileage out using our own custom tools, generally created or maintained by one person, e.g., while I was there, most simulator cycles were run on a custom simulator that was maintained by one person, which saved millions a year in simulator costs (standard pricing for a simulator at the time was a few thousand dollars per license per year and we had a farm of about a thousand simulation machines). You might think that, if a single person can create or maintain a tool that's worth millions of dollars a year to the company, our competitors would do the same thing, just like you might think that if you can ship faster and at a lower cost by hiring a person who knows how to crack a wafer open, our competitors would do that, but they mostly didn't.

Joel Spolsky has an old post where he says:

“Find the dependencies — and eliminate them.” When you're working on a really, really good team with great programmers, everybody else's code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.

We had a similar attitude, although I'd say that we were a bit more humble. We didn't think that everyone else was producing garbage but, we also didn't assume that we couldn't produce something comparable to what we could buy for a tenth of the cost. From talking to folks at some competitors, there was a pretty big cultural difference between how we operated and how they operated. It simply didn't occur to them that they didn't have to buy into the standard American business logic that you should focus on your core competencies, that you can think through whether or not it makes sense to do something in-house on the merits of the particular thing instead of outsourcing your thinking to a pithy saying.

I once watched, from the inside, a company undergo this cultural shift. A few people in leadership decided that the company should focus on its core competencies, which meant abandoning custom software for infrastructure. This resulted in quite a few large migrations from custom internal software to SaaS solutions and open source software. If you watched the discussions on "why" various projects should or shouldn't migrate, there were a few unusually unreasonable people who tried to reason through particular cases on the merits of each case (in a post on pushing back against orders from the top, Yossi Kreinin calls these people insane employees; I'm going to refer to the same concept in this post, but instead call people who do this unusually unreasonable). But, for the most part, people bought the party line and pushed for a migration regardless of the specifics.

The thing that I thought was interesting was that leadership didn't tell particular teams they had to migrate and there weren't really negative consequences for teams where an "unusually unreasonable person" pushed back in order to keep running an existing system for reasonable reasons. Instead, people mostly bought into the idea and tried to justify migrations for vaguely plausible sounding reasons that weren't connected to reality, resulting in funny outcomes like moving to an open source system "to save money" when the new system was quite obviously less efficient⁶ and, predictably, required much higher capex and opex. The cost savings was supposed to come from shrinking the team, but the increase in operational cost dominated the change in the cost of the team and the complexity of operating the system meant that the team size increased instead of decreasing. There were a number of cases where it really did make sense to migrate, but the stated reasons for migration tended to be unrelated or weakly related to the reasons it actually made sense to migrate. Once people absorbed the idea that the company should focus on core competencies, the migrations were driven by the cultural idea and not any technical reasons.

The pervasiveness of decisions like the above, technical decisions made without serious technical consideration, is a major reason that the selection pressure on companies to make good products is so weak. There is some pressure, but it's noisy enough that successful companies often route around making a product that works, like in the Mongo example from above, where Mongo's decision to loudly repeat demonstrably bogus performance claims and making demonstrably false correctness claims was, from a business standpoint, superior to focusing on actual correctness and performance; by focusing their resources where it mattered for the business, they managed to outcompete companies that made the mistake of devoting serious resources to performance and correctness.

Yossi's post about how an unusually unreasonable person can have outsized impact in a dimension they value at their firm also applies to impact outside of a firm. Kyle Kingsbury, mentioned above, is an example of this. At the rates that I've heard Jepsen is charging now, Kyle can bring in what a senior developer at BigCo does (actually senior, not someone with the title "senior"), but that was after years of working long hours at below market rates on an uncertain endeavour, refuting FUD from his critics (if you read the replies to the linked posts or, worse yet, the actual tickets where he's involved in discussions with developers, the replies to Kyle were a constant stream of nonsense for many years, including people working for vendors feeling like he has it out for them in particular, casting aspersions on his character⁷, and generally trashing him). I have a deep respect for people who are willing to push on issues like this despite the system being aligned against them but, my respect notwithstanding, basically no one is going to do that. A system that requires someone like Kyle to take a stand before successful firms will put effort into correctness instead of correctness marketing is going to produce a lot of products that are good at marketing correctness without really having decent correctness properties (such as the data sync product mentioned in this post, whose website repeatedly mentions how reliable and safe the syncing product is despite having a design that is fundamentally broken).

It's also true at the firm level that it often takes an unusually unreasonable firm to produce a really great product instead of just one that's marketed as great, e.g., Volvo, the one car manufacturer that seemed to try to produce a level of structural safety beyond what could be demonstrated by IIHS tests fared so poorly as a business that it's been forced to move upmarket and became a niche, luxury, automaker since safety isn't something consumers are really interested in despite car accidents being a leading cause of death and a significant source of life expectancy loss. And it's not clear that Volvo will be able to persist in being an unreasonable firm since they weren't able to survive as an independent automaker. When Ford acquired Volvo, Ford started moving Volvos to the shared Ford C1 platform, which didn't fare particularly well in crash tests. Since Geely has acquired Volvo, it's too early to tell for sure if they'll maintain Volvo's commitment to designing for real-world crash data and not just crash data that gets reported in benchmarks. If Geely declines to continue Volvo's commitment to structural safety, it may not be possible to buy a modern car that's designed to be safe.

Most markets are like this, except that there was never an unreasonable firm like Volvo in the first place. On unreasonable employees, Yossi says

Who can, and sometimes does, un-rot the fish from the bottom? An insane employee. Someone who finds the forks, crashes, etc. a personal offence, and will repeatedly risk annoying management by fighting to stop these things. Especially someone who spends their own political capital, hard earned doing things management truly values, on doing work they don't truly value – such a person can keep fighting for a long time. Some people manage to make a career out of it by persisting until management truly changes their mind and rewards them. Whatever the odds of that, the average person cannot comprehend the motivation of someone attempting such a feat.

It's rare that people are willing to expend a significant amount of personal capital to do the right thing, whatever that means to someone, but it's even rarer that the leadership of a firm will make that choice and spend down the firm's capital to do the right thing.

Economists have a term for cases where information asymmetry means that buyers can't tell the difference between good products and "lemons", "a market for lemons", like the car market (where the term lemons comes from), or both sides of the hiring market. In economic discourse, there's a debate over whether cars are a market for lemons at all for a variety of reasons (lemon laws, which allow people to return bad cars, don't appear to have changed how the market operates, very few modern cars are lemons when that's defined as a vehicle with serious reliability problems, etc.). But looking at whether or not people occasionally buy a defective car is missing the forest for the trees. There's maybe one car manufacturer that really seriously tries to make a structurally safe car beyond what standards bodies test (and word on the street is that they skimp on the increasingly important software testing side of things) because consumers can't tell the difference between a more or less safe car beyond the level a few standards bodies test to. That's a market for lemons, as is nearly every other consumer and B2B market.

Appendix: culture

Something I find interesting about American society is how many people think that someone who gets the raw end of a deal because they failed to protect themselves against every contingency "deserves" what happened (orgs that want to be highly effective often avoid this by having a "blameless" culture, but very few people have exposure to such a culture).

Some places I've seen this recently:

Person had a laptop stolen in a cafe; blamed for not keeping their eye on the laptop the entire time since no reasonable person would ever let their eyes off of any belongings for 10 seconds as they turned their head to briefly chat with someone
Person posted a PSA that they were caught out by a change in the terms of service of a company and other people should be aware of the same thing, people said that the person caught out was dumb for not reading every word of every terms of service update they're sent
(many times, on r/idiotsincars): person gets in an accident that would've been difficult or impossible to reasonably avoid and people tell the person they're a terrible driver for not having avoided the accident
- At least once, the person did a frame-by-frame analysis that showed that they reacted to, within one frame of latency, as fast as humanly possible, and was still told they should've avoided the accident
- Often, people will say things like "I would never get into that situation in the first place", which, in the circumstance where someone is driving past a parked car, results in absurd statements like "I would never pass a vehicle at more than 10mph", as if the person making the comment slows down to 10mph on every street that has parked or stopped cars on it.
Person griped on flyertalk forum that Google maps instructions are unclear if you're not a robot (e.g., "turn right in 500 meters", which could be one of multiple intersections) and people responded with things like "I never go anywhere without being completely familiar with the route" and that you should map out all of your driving beforehand, just like you would for a road trip with a paper map in 1992 (this was used as a justification for the reasonableness of mapping out all travel beforehand – I did it back then and anyone who isn't dumb would do it now)
- People with those kinds of negative responses were highly upvoted; no one suggested switching to Apple Maps, which gives clear, landmark based directions like "go through the light and then take the next right"

If you read these kinds of discussions, you'll often see people claiming "that's just how the world is" and going further and saying that there is no other way the world could be, so anyone who isn't prepared for that is an idiot.

Going back to the laptop theft example, anyone who's traveled, or even read about other cultures, can observe that the things that North Americans think are basically immutable consequences of a large-scale society are arbitrary. For example, if you leave your bag and laptop on a table at a cafe in Korea and come back hours later, the bag and laptop are overwhelmingly likely to be there I've heard this is true in Japan as well. While it's rude to take up a table like that, you're not likely to have your bag and laptop stolen.

And, in fact, if you tweak the context slightly, this is basically true in America. It's not much harder to walk into an empty house and steal things out of the house (it's fairly easy to learn how to pick locks and even easier to just break a window) than it is to steal things out of a cafe. And yet, in most neighbourhoods in America, people are rarely burglarized and when someone posts about being burglarized, they're not excoriated for being a moron for not having kept an eye on their house. Instead, people are mostly sympathetic. It's considered normal to have unattended property stolen in public spaces and not in private spaces, but that's more of a cultural distinction than a technical distinction.

There's a related set of stories Avery Pennarun tells about the culture shock of being an American in Korea. One of them is about some online ordering service you can use that's sort of like Amazon. With Amazon, when you order something, you get a box with multiple bar/QR/other codes on it and, when you open it up, there's another box inside that has at least one other code on it. Of course the other box needs the barcode because it's being shipped through some facility at-scale where no one knows what the box is or where it needs to go and the inner box also had to go through some other kind of process and it also needs to be able to be scanned by a checkout machine if the item is sold at a retailer. Inside the inner box is the item. If you need to return the item, you put the item back into its barcoded box and then put that box into the shipping box and then slap another barcode onto the shipping box and then mail it out.

So, in Korea, there's some service like Amazon where you can order an item and, an hour or two later, you'll hear a knock at your door. When you get to the door, you'll see an unlabeled box or bag and the item is in the unlabeled container. If you want to return the item, you "tell" the app that you want to return the item, put it back into its container, put it in front of your door, and they'll take it back. After seeing this shipping setup, which is wildly different from what you see in the U.S., he asked someone "how is it possible that they don't lose track of which box is which?". The answer he got was, "why would they lose track of which box is which?". His other stories have a similar feel, where he describes something quite alien, asks a local how things can work in this alien way, who can't imagine things working any other way and response with "why would X not work?"

As with the laptop in cafe example, a lot of Avery's stories come down to how there are completely different shared cultural expectations around how people and organizations can work.

Another example of this is with covid. Many of my friends have spent most of the last couple of years in Asian countries like Vietnam or Taiwan, which have had much lower covid rates, so much so that they were barely locked down at all. My friends in those countries were basically able to live normal lives, as if covid didn't exist at all (at least until the latest variants, at which point they were vaccinated and at relatively low risk for the most serious outcomes), while taking basically zero risk of getting covid.

In most western countries, initial public opinion among many people was that locking down was pointless and there was nothing we could do to prevent an explosion of covid. Multiple engineers I know, who understand exponential growth and knew what the implications were, continued normal activities before lockdown and got and (probably) spread covid. When lockdowns were implemented, there was tremendous pressure to lift them as early as possible, resulting in something resembling the "adaptive response" diagram from this post. Since then, many people (I have a project tallying up public opinion on this that I'm not sure I'll ever prioritize enough to complete) have changed their opinion to "having ever locked down was stupid, we were always going to end up with endemic covid, all of this economic damage was pointless". If we look at in-person retail sales data or restaurant data, we can easily see that many people were voluntarily limiting their activities before and after lockdowns in the first year or so of the pandemic when the virus was in broad circulation.

Meanwhile, in some Asian countries, like Taiwan and Vietnam, people mostly complied with lockdowns when they were instituted, which means that they were able to squash covid in the country when outbreaks happened until relatively recently, when covid mutated into forms that spread much more easily and people's tolerance for covid risk went way up due to vaccinations. Of course, covid kept getting reintroduced into countries that were able to squash it because other countries were not, in large part due to the self-fulfilling belief that it would be impossible to squash covid.

Coming back to when it makes sense to bring something in-house, even in cases where it superficially sounds like it shouldn't, because the expertise is 99% idle or a single person would have to be able to build software that a single firm would pay millions of dollars a year for, much of this comes down to whether or not you're in a culture where you can trust another firm's promise. If you operate in a society where it's expected that other firms will push you to the letter of the law with respect to whatever contract you've negotiated, it's frequently not worth the effort to negotiate a contract that would give you service even one half as good as you'd get from someone in house. If you look at how these contracts end up being worded, companies often try to sneak in terms that make the contract meaningless, and even when you managed to stamp out all of that, legally enforcing the contract is expensive and, in the cases I know of where companies regularly violated their agreement for their support SLA (just for example), the resolution was to terminate the contract rather than pursue legal action because the cost of legal action wouldn't be worth anything that could be gained.

If you can't trust other firms, you frequently don't have a choice with respect to bringing things in house if you want them to work.

Although this is really a topic for another post, I'll note that lack of trust that exists across companies can also hamstring companies when it exists internally. As we discussed previously, a lot of larger scale brokenness also comes out of the cultural expectations within organizations. A specific example of this that leads to pervasive organizational problems is lack of trust within the organization. For example, a while back, I was griping to a director that a VP broke a promise and that we were losing a lot of people for similar reasons. The director's response was "there's no way the VP made a promise". When I asked for clarification, the clarification was "unless you get it in a contract, it wasn't a promise", i.e., the rate at which VPs at the company lie is high enough that a verbal commitment from a VP is worthless; only a legally binding commitment that allows you to take them to court has any meaning.

Of course, that's absurd, in that no one could operate at a BigCo while going around and asking for contracts for all their promises since they'd immediately be considered some kind of hyperbureaucratic weirdo. But, let's take the spirit of the comment seriously, that only trust people close to you. That's good advice in the company I worked for but, unfortunately for the company, the implications are similar to the inter-firm example, where we noted that a norm where you need to litigate the letter of the law is expensive enough that firms often bring expertise in house to avoid having to deal with the details. In the intra-firm case and you'll often see teams and orgs "empire build" because they know they, at least the management level, they can't trust anyone outside their fiefdom.

While this intra-firm lack of trust tends to be less costly than the inter-firm lack of trust since there are better levers to get action on an organization that's the cause of a major blocker, it's still fairly costly. Virtually all of the VPs and BigCo tech execs I've talked to are so steeped in the culture they're embedded in that they can't conceive of an alternative, but there isn't an inherent reason that organizations have to work like that. I've worked at two companies where people actually trust leadership and leadership does generally follow through on commitments even when you can't take them to court, including my current employer, Wave. But, at the other companies, the shared expectation that leadership cannot and should not be trusted "causes" the people who end up in leadership roles to be untrustworthy, which results in the inefficiencies we've just discussed.

People often think that having a high degree of internal distrust is inevitable as a company scales, but people I've talked to who were in upper management or fairly close to the top of Intel and Google said that the companies had an extended time period where leadership enforced trustworthiness and that stamping out dishonesty and "bad politics" was a major reason the company was so successful, under Andy Grove and Eric Schmidt, respectively. When the person at the top changed and a new person who didn't enforce honesty came in, the standard cultural norms that you see at the upper levels of most big companies seeped in, but that wasn't inevitable.

When I talk to people who haven't been exposed to BigCo leadership culture and haven't seen how decisions are actually made, they often find the decision making processes to be unbelievable in much the same way that people who are steeped in BigCo leadership culture find the idea that a large company could operate any other way to be unbelievable.

It's often difficult to see how absurd a system is from the inside. Another perspective on this is that Americans often find Japanese universities and the work practices of Japanese engineering firms absurd, though often not as absurd as the promotion policies in Korean chaebols, which are famously nepotistic, e.g., Chung Mong-yong is the CEO of Hyundai Sungwoo because he's the son of Chung Soon-yung, who was the head of Hyundai Sungwoo because he was the younger brother of Chung Ju-yung, the founder of Hyundai Group (essentially the top-level Hyundai corporation), etc. But Japanese and Korean engineering firms are not, in general, less efficient than American engineering firms outside of the software industry despite practices that seem absurdly inefficient to American eyes. American firms didn't lose their dominance in multiple industries while being more efficient; if anything, market inefficiencies allowed them to hang on to marketshare much longer than you would naively expect if you just looked at the technical merit of their products.

There are offsetting inefficiencies in American firms that are just as absurd as effectively having familiar succession of company leadership in Korean chaebols. It's just that the inefficiencies that come out of American cultural practices seem to be immutable facts about the world to people inside the system. But when you look at firms that have completely different cultures, it becomes clear that cultural norms aren't a law of nature.

Appendix: downsides of build

Of course, building instead of buying isn't a panacea. I've frequently seen internal designs that are just as broken as the data sync product described in this post. In general, when you see a design like that, a decent number of people explained why the design can never work during the design phase and were ignored. Although "build" gives you a lot more control than "buy" and gives you better odds of a product that works because you can influence the design, a dysfunctional team in a dysfunctional org can quite easily make products that don't work.

There's a Steve Jobs quote that's about companies that also applies to teams:

It turns out the same thing can happen in technology companies that get monopolies, like IBM or Xerox. If you were a product person at IBM or Xerox, so you make a better copier or computer. So what? When you have monopoly market share, the company's not any more successful.

So the people that can make the company more successful are sales and marketing people, and they end up running the companies. And the product people get driven out of the decision making forums, and the companies forget what it means to make great products. The product sensibility and the product genius that brought them to that monopolistic position gets rotted out by people running these companies that have no conception of a good product versus a bad product.

They have no conception of the craftsmanship that's required to take a good idea and turn it into a good product. And they really have no feeling in their hearts, usually, about wanting to really help the customers."

For "efficiency" reasons, some large companies try to avoid duplicate effort and kill projects if they seem too similar to another project, giving the team that owns the canonical verison of a product a monopoly. If the company doesn't have a culture of trying to do the right thing, this has the same problems that Steve Jobs discusses, but at the team and org level instead of the company level.

The workaround a team I was on used was to basically re-implement a parallel stack of things we relied on that didn't work. But this was only possible beacuse leadership didn't enforce basically anything. Ironically, this was despite their best efforts — leadership made a number of major attempts to impose top-down control, but they didn't understand how to influence an organization, so the attempts failed. Had leadership been successful, the company would've been significantly worse off. There are upsides to effective top-down direction when leadership has good plans, but that wasn't really on the table, so it's actually better that leadership didn't know how to execute.

Thanks to Fabian Giesen, Yossi Kreinen, Peter Bhat Harkins, Ben Kuhn, Laurie Tratt, John Hergenroeder, Tao L., @softminus, Justin Blank, @deadalnix, Dan Lew, @ollyrobot, Sophia Wisdom, Elizabeth Van Nostrand, Kevin Downey, and @PapuaHardyNet for comments/corrections/discussion.

To some, that position is so absurd that it's not believable that anyone would hold that position (in response to my first post that featured the Andreessen quote, above, a number of people told me that it was an exaggerated straw man, which is impossible for a quote, let alone one that sums up a position I've heard quite a few times), but to others, it's an immutable fact about the world. ^[return]
On the flip side, if we think about things from the vendor side of things, there's little incentive to produce working products since the combination of the fog of war plus making false claims about a product working seems to be roughly as good as making a working product (at least until someone like Kyle Kingsbury comes along, which never happens in most industries), and it's much cheaper.

And, as Fabian Giesen points out, when vendors actually want to produce good or working products, the fog of war also makes that difficult:

But producers have a dual problem, which is that all the signal you get from consumers is sporadic, infrequent and highly selected direct communication, as well as a continuous signal of how sales look over time, which is in general very hard to map back to why sales went up or down.

You hear directly from people who are either very unhappy or very happy, and you might hear second-hand info from your salespeople, but often that's pure noise. E.g. with RAD products over the years a few times we had a prospective customer say, "well we would license it but we really need X" and we didn't have X. And if we heard that 2 or 3 times from different customers, we'd implement X and get back to them a few months later. More often than not, they'd then ask for Y next, and it would become clear over time that they just didn't want to license for some other reason and saying "we need X, it's a deal-breaker for us" for a couple choices of X was just how they chose to get out of the eval without sounding rude or whatever.

In my experience that's a pretty thorny problem in general, once you spin something out or buy something you're crossing org boundaries and lose most of the ways you otherwise have to cut through the BS and figure out what's actually going on. And whatever communication does happen is often forced to go through a very noisy, low-bandwidth, low-fidelity, high-latency channel.

^[return]
Note that even though it was somewhat predictable that a CPU design team at Apple or Amazon that was well funded had a good chance of being able to produce a best-in-class CPU (e.g., see this 2013 comment about the effectiveness of Apple's team and this 2015 comment about other mobile vendors) that would be a major advantage for their firm, this doesn't mean that the same team should've been expected to succeed if they tried to make a standalone business. In fact, Apple was able to buy their core team cheaply because the team, after many years at DEC and then successfully founding SiByte, founded PA Semi, which basically failed as a business. Similarly, Amazon's big silicon initial hires were from Annapurna (also a failed business that was up for sale because it couldn't survive independently) and Smooth Stone (a startup that failed so badly that it didn't even need to be acquired and people could be picked up individually). Even when there's an obvious market opportunity, factors like network effects, high fixed costs, up front capital expenditures, the ability of incumbent players to use market power to suppress new competitors, etc., can and often does prevent anyone from taking the opportunity. Even though we can now clearly see that there were large opportunities available for the taking, there's every reason to believe that, based on the fates of many other CPU startups to date, an independent startup that attempt to implement the same ideas wouldn't have been nearly a successful and most likely have gone bankrupt or taken a low offer relative to the company's value due to the company's poor business prospects.

Also, before Amazon started shipping ARM server chips, the most promising ARM server chip, which had pre-orders from at least one major tech company, was killed because it was on the wrong side of an internal political battle.

The chip situation isn't so different from the motivating example we looked at in our last post, baseball scouting, where many people observed that baseball teams were ignoring simple statistics they could use to their advantage. But, none of the people observing that were in a position to run a baseball team for decades, allowing the market opportunity to persist for decades.
^[return]
Something that amuses me is how some package delivery services appear to apply relatively little effort to make sure that someone even made an attempt to delivery the package. When packages are marked delivered, there's generally a note about how it was delivered, which is frequently quite obviously wrong for the building, e.g., "left with receptionist" for a building with no receptionist or "left on porch" for an office building with no porch and a receptionist who was there during the alleged delivery time. You could imagine services would, like Amazon, request a photo along with "proof of delivery" or perhaps use GPS to check that the driver was plausibly at least in the same neighborhood as the building at the time of delivery, but they generally don't seem to do that?

I'd guess that a lot of the fake deliveries come from having some kind of quota, one that's difficult or impossible to achieve, combined with weak attempts at verifying that a delivery was done or even attempted.
^[return]
When I say they solved it, I mean that Amazon delivery drivers actually try to deliver the package maybe 95% of the time to the apartment buildings I've lived in, vs. about 25% for UPS and Fedex and much lower for USPS and Canada Post, if we're talking about big packages and not letters. ^[return]
Very fittingly for this post, I saw an external discussion on this exact thing where someone commented that it must've been quite expensive for the company to switch to the new system due to its known inefficiencies.

In true cocktail party efficient markets hypothesis form, an internet commenter replied that the company wouldn't have done it if it was inefficient and therefore it must not have been as inefficient as the first commenter thought.

I suspect I spent more time looking at software TCO than anyone else at the company and the system under discussion was notable for having one of the largest increases in cost of any system at the company without a concomitant increase in load. Unfortunately, the assumption that competition results in good internal decisions is just as false as the assumption that competition results in good external decisions.
^[return]
Note that if you click the link but don't click through to the main article, the person defending Kyle made the original quote seem more benign than it really is out of politeness because he elided the bit where the former Redis developer advocate (now "VP of community" for Zig) said that Jespen is "ultimately not that different from other tech companies, and thus well deserving of boogers and cum". ^[return]

Misidentifying talent

Mon, 21 Feb 2022 00:00:00 +0000

[Click to collapse / expand section on sports]

Here are some notes from talent scouts:

Recruit A:
- ... will be a real specimen with chance to have a Dave Parker body. Facially looks like Leon Wagner. Good body flexibility. Very large hands.
Recruit B:
- Outstanding physical specimen – big athletic frame with broad shoulders and long, solid arms and leg. Good bounce in his step and above avg body control. Good strong face.
Recruit C:
- Hi butt, longish arms & legs, leanish torso, young colt
- [different scout]: Wiry loose good agility with good face
- [another scout]: Athletic looking body, loose, rangy, slightly bow legged.

Out of context, you might think they were scouting actors or models, but these are baseball players ("A" is Lloyd Moseby, "B" is Jim Abbott, and "C" is Derek Jeter), ones that were quite good (Lloyd Moseby was arguably only a very good player for perhaps four years, but that makes him extraordinary compared to most players who are scouted). If you read other baseball scouting reports, you'll see a lot of comments about how someone has a "good face", who they look like, what their butt looks like, etc.

Basically everyone wants to hire talented folks. But even in baseball, where returns to hiring talent are obvious and high and which is the most easily quantified major U.S. sport, people made fairly obvious blunders for a century due to relying on incorrectly honed gut feelings that relied heavily on unconscious as well as conscious biases. Later, we'll look at what baseball hiring means for other fields, but first, let's look at how players who didn't really pan out ended up with similar scouting reports (programmers who don't care about sports can think of this as equivalent to interview feedback) as future superstars, such as the following comments on Adam Eaton, who was a poor player by pro standards despite being considered one of the hottest prospects (potential hires) of his generation:

Scout 1: Medium frame/compact/firm. A very good athlete / shows quick "cat-like" reactions. Excellent overall body strength. Medium hands / medium length arms / w strong forearms ... Player is a tough competitor. This guy has some old fashioned bull-dog in his make-up.
Scout 2: Good body with frame to develop. Long arms and big hands. Narrow face. Has sideburns and wears hat military style. Slope shoulders. Strong inlegs ... Also played basketball. Good athlete .... Attitude is excellent. Can't see him breaking down. One of the top HS pitchers in the country
Scout 3: 6'1"-6'2" 180 solid upper and lower half. Room to pack another 15 without hurting

On the flip side, scouts would also pan players who would later turn out to be great based on their physical appearance, such as these scouts who were concerned about Albert Pujols's weight:

Scout 1: Heavy, bulky body. Extra (weight) on lower half. Future (weight) problem. Aggressive hitter with mistake HR power. Tends to be a hacker.
Scout 2: Good bat (speed) with very strong hands. Competes well and battles at the plate. Contact seems fair. Swing gets a little long at times. Will over pull. He did not hit the ball hard to CF or RF. Weight will become an issue in time.

Pujols ended up becoming one of the best baseball players of all time (currently ranked 32nd by WAR). His weight wasn't a problem, but if you read scouting reports on other great players who were heavy or short, they were frequently underrated. Of course, baseball scouting reports didn't only look at people's appearances, but scouts were generally highly biased by what they thought an athlete should look like.

Because using stats in baseball has "won" (top teams all employ stables of statisticians nowadays) and "old school" folks don't want to admit this, we often see people saying that using stats doesn't really result in different outcomes than we used to get. But this is so untrue that the examples people give are generally self-refuting. For example, here's what Sports Illustrated had to say on the matter:

Media and Internet draft prognosticators love to play up the “scrappy little battler” aspect with Madrigal, claiming that modern sabermetrics helps scouts include smaller players that were earlier overlooked. Of course, that is hogwash. A players [sic] abilities dictate his appeal to scouts—not height or bulk—and smaller, shorter players have always been a staple of baseball-from Mel Ott to Joe Morgan to Kirby Puckett to Jose Altuve.

These are curious examples to use in support of scouting since Kirby Puckett was famously overlooked by scouts despite putting up statistically dominant performances and was only able to become a baseball player through random happenstance, when the assistant director of the Twins farm system went to watch his own son play in a baseball game his and saw Kirby Puckett in the same game, which led to the Twins drafting Kirby Puckett, who carried the franchise for a decade.

Joe Morgan was also famously overlooked and only managed to become a professional baseball player through random happenstance. Morgan put up statistically dominant numbers in high school, but was ignored due to his height. Because he wasn't drafted by a pro team, he went to Oakland City College, where he once again put up great numbers that were ignored. The reason a team noticed him was a combination of two coincidences. First, a new baseball team was created and that new team needed to fill a team and the associated farm system, which meant that they needed a lot of players. Second, that new baseball team needed to hire scouts and hired Bill Wight (who wasn't previously working as a scout) as a scout. Wight became known for not having the same appearance bias as nearly every other scout and was made fun of for signing "funny looking" baseball players. Bill convinced the new baseball team to "hire" quite a few overlooked players, including Joe Morgan.

Mel Ott was also famously overlooked and only managed to become a professional baseball player through happenstance. He was so dominant in high school that he played for adult semi-pro teams in his spare time. However, when he graduated, pro baseball teams didn't want him because he was too small, so he took a job at a lumber company and played for the company team. The owner of the lumber company was impressed by his baseball skills and, luckily for Ott, the owner of the lumber company was business partners and friends with the owner of a baseball team and effectively got Ott a position on a pro baseball team, resulting in the 20th best baseball career of all time as ranked by WAR¹. Most short baseball players probably didn't get a random lucky break; for every one who did, there are likely many who didn't. If we look at how many nearly-ignored-but-lucky players put up numbers that made them all-time greats, it seems likely that the vast majority of the potentially greatest players of all time who played amateur or semi-pro baseball were ignored and did not play professional baseball (if this seems implausible, when reading the upcoming sections on chess, go, and shogi, consider what would happen if you removed all of the players who don't look like they should be great based on what people think makes someone cognitively skilled at major tech companies, and then look at what fraction of all-time-greats remain).

Deciding who to "hire" for a baseball team was a high stakes decision with many millions of dollars (in 2022 dollars) on the line, but rather than attempt to seriously quantify productivity, teams decided who to draft (hire) based on all sorts of irrelevant factors. Like any major sport, baseball productivity is much easier to quantify than in most real-world endeavors since the game is much simpler than "real" problems are. And, among major U.S. sports, baseball is the easiest sport to quantify, but this didn't stop baseball teams from spending a century overindexing on visually obvious criteria such as height and race.

I was reminded of this the other day when, the other day, I saw a thread on Twitter where a very successful person talks about how they got started, saying that they were able to talk their way into an elite institution despite being unqualified and use this story to conclude that elite gatekeepers are basically just scouting for talent and that you just need to show people that you have talent:

One college related example from my life is that I managed to get into CMU with awful grades and awful SAT scores (I had the flu when I took the test :/)

I spent a month learning everything about CMU's CS department, then drove there and talked to professors directly when I first showed up at the campus the entrance office asked my GPA and SAT, then asked me to leave. But I managed to talk to one professor, who sent me to their boss, recursively till I was talking to the vice president of the school he asked me why I'm good enough to go to CMU and I said "I'm not sure I am. All these other kids are really smart. I can leave now" and he interrupted me and reminded me how much agency it took to get into that room.

He gave me a handwritten acceptance letter on the spot ... I think one secret, at least when it comes to gatekeepers, is that they're usually just looking for high agency and talent.

I've heard this kind of story from other successful people, who tend to come to bimodal conclusions on what it all means. Some conclude that the world correctly recognized their talent and that this is how the world works; talent gets recognized and rewarded. Others conclude that the world is fairly random with respect to talent being rewarded and that they got lucky to get rewarded for their talent when many other people with similar talents who used similar strategies were passed over².

Another time I was reminded of old baseball scouting reports was when I heard about how a friend of mine who's now an engineering professor at a top Canadian university got there. Let's call her Jane. When Jane was an undergrad at the university she's now a professor at, she was sometimes helpfully asked "are you lost?" when she was on campus. Sometimes this was because, as a woman, she didn't look like was in the right place when she was in an engineering building. Other times, it was because she looked like and talked like someone from rural Canada. Once, a security guard thought she was a homeless person who had wandered onto campus. After a few years, she picked up the right clothes and mannerisms to pass as "the right kind of person", with help from her college friends, who explained to her how one is supposed to talk and dress, but when she was younger, people's first impression was that she was an admin assistant, and now their first impression is that she's a professor's wife because they don't expect a woman to be a professor in her department. She's been fairly successful, but it's taken a lot more work than it would've for someone who looked the part.

On whether or not, in her case, her gate keepers were just looking for agency and talent, she once failed a civil engineering exam because she'd never heard of a "corn dog" and also barely passed an intro programming class she took where the professor announced that anyone who didn't already know how to program was going to fail.

The corn dog exam failure was because there was a question on a civil engineering exam where students were supposed to design a corn dog dispenser. My friend had never heard of a corn dog and asked the professor what a corn dog was. The professor didn't believe that she didn't know what a corn dog was and berated her in front of the entire class to for asking a question that clearly couldn't be serious. Not knowing what a corn dog was, she designed something that put corn inside a hot dog and dispensed a hot dog with corn inside, which failed because that's not what a corn dog is.

It turns out the gatekeepers for civil engineering and programming were not, in fact, just looking for agency and were instead looking for someone who came from the right background. I suspect this is not so different from the CMU professor who admitted a promising student on the spot, it just happens that a lot of people pattern match "smart teenage boy with a story about why their grades and SAT scores are bad" to "promising potential prodigy" and "girl from rural Canada with the top grade in her high school class who hasn't really used a computer before and dresses like a poor person from rural Canada because she's paying for college while raising her younger brother because their parents basically abandoned both of them" to "homeless person who doesn't belong in engineering".

Another thing that reminded me of how funny baseball scouting reports are is a conversation I had with Ben Kuhn a while back.

Me: it's weird how tall so many of the men at my level (senior staff engineer) are at big tech companies. In recent memory, I think I've only been in a meeting with one man who's shorter than me at that level or above. I'm only 1" shorter than U.S. average! And the guy who's shorter than me has worked remotely for at least a decade, so I don't know if people really register his height. And people seem to be even taller on the management track. If I look at the VPs I've been in meetings with, they must all be at least 6' tall.
Ben: Maybe I could be a VP at a big tech company. I'm 6' tall!
Me: Oh, I guess I didn't know how tall 6' tall is. The VPs I'm in meetings with are noticeably taller than you. They're probably at least 6'2"?
Ben: Wow, that's really tall for a minimum. 6'2" is 96%-ile for U.S. adult male

When I've discussed this with successful people who work in big companies of various sorts (tech companies, consulting companies, etc.), men who would be considered tall by normal standards, 6' or 6'1", tell me that they're frequently the shortest man in the room during important meetings. 6'1" is just below the median height of a baseball player. There's something a bit odd about height seeming more correlated to success as a consultant or a programmer than in baseball, where height directly conveys an advantage. One possible explanation would be due to a halo effect, where positive associations about tall or authoritative seeming people contribute to their success.

When I've seen this discussed online, someone will point out that this is because height and cognitive performance are correlated. But if we look at the literature on IQ, the correlation isn't strong enough to explain something like this. We can also observe this if we look at fields where people's mental acuity is directly tested by something other than an IQ test, such as in chess, where most top players are around average height, with some outliers in both directions. Even without looking at the data in detail, this should be expected because correlation between height and IQ is weak, with much the correlation due to the relationship at the low end³, and the correlation between IQ and performance in various mental tasks is also weak (some people will say that it's strong by social science standards, but that's very weak in terms of actual explanatory power even when looking at the population level and it's even weaker at the individual level). And then if we look at chess in particular, we can see that the correlation is weak, as expected.

Since the correlation is weak, and there are many more people around average height than not, we should expect that most top chess players are around average height. If we look at the most dominant chess players in recent history, Carlsen, Anand, and Kasparov, they're 5'8", 5'8", and 5'9", respectively (if you look at different sources, they'll claim heights of plus or minus a couple inches, but still with a pretty normal range; people often exaggerate heights; if you look at people who try to do real comparisons either via photos or in person, measured heights are often lower than what people claim their own height is⁴).

It's a bit more difficult to find heights of go and shogi players, but it seems like the absolute top modern players from this list I could find heights for (Lee Sedol, Yoshiharu Habu) are roughly in the normal range, with there being some outliers in both directions among elite players who aren't among the best of all time, as with chess.

If it were the case that height or other factors in appearance were very strongly correlated with mental performance, we would expect to see a much stronger correlation between height and performance in activities that relatively directly measure mental performance, like chess, than we do between height and career success, but it's the other way around, which seems to indicate that the halo effect from height is stronger than any underlying benefits that are correlated with height.

If we look at activities where there's a fair amount of gatekeeping before people are allowed to really show their skills but where performance can be measured fairly accurately and where hiring better employees has an immediate, measurable, direct impact on company performance, such as baseball and hockey, we can see that people went with their gut instinct over data for decades after there were public discussions about how data-driven approaches found large holes in people's intuition.

If we then look at programming, where it's somewhere between extremely difficult and impossible to accurately measure individual performance and the impact of individual performance on company success is much less direct than in sports, what should our estimate of how accurate talent assessment be?

The pessimistic view is that it seems implausible that we should expect that talent assessment is better than in sports, where it took decades of there being fairly accurate and rigorous public write-ups of performance assessments for companies to take talent assessment seriously. With programming, talent assessment isn't even far enough along that anyone can write up accurate evaluations of people across the industry, so we haven't even started the decades long process of companies fighting to keep evaluating people based on personal opinions instead of accurate measurements.

Jobs have something equivalent to old school baseball scouting reports at multiple levels. At the hiring stage, there are multiple levels of filters that encode people's biases. A classic study on this is Bertrand and Sendhil Mullainathan's paper, which found that "white sounding" names on resumes got more callbacks for interviews than "black sounding" names and that having a "white sounding" name on the resume increased the returns to having better credentials on the resume. Since then, many variants of this study have been done, e.g., resumes with white sounding names do better than resumes with Asian sounding names, professors with white sounding names are evaluated on CVs are evaluated as having better interpersonal skills than professors with black and Asian sounding names on CVs, etc.

The literature on promotions and leveling is much weaker, but I and other folks who are in highly selected environments that effectively require multiple rounds of screening, each against more and more highly selected folks, such as VPs, senior (as in "senior staff"+) ICs, professors at elite universities, etc., have observed that filtering on height is as severe or more severe than in baseball but less severe than in basketball.

That's curious when, in mental endeavors where the "promotion" criteria are directly selected by performance, such as in chess, height appears to only be very weakly correlated to success. A major issue in the literature on this is that, in general, social scientists look at averages. In a lot of the studies, they simply produce a correlation coefficient. If you're lucky, they may produce a graph where, for each height, they produce an average of something or other. That's the simplest thing to do but this only provides a very coarse understanding of what's going on.

Because I like knowing how things tick, including organizations and people's opinions, I've (informally, verbally) polled a lot of engineers about what they thought about other engineers. What I found was that there was a lot of clustering of opinions, resulting in clusters of folks that had rough agreement about who did excellent work. Within each cluster, people would often disagree about the ranking of engineers, but they would generally agree on who was "good to excellent".

One cluster was (in my opinion; this could, of course, also just be my own biases) people who were looking at the output people produced and were judging people based on that. Another cluster was of people who were looking at some combination of height and confidence and were judging people based on that. This one was a mystery to me for a long time (I've been asking people questions like this and collating the data out of habit, long before I had the idea to write this post and, until I recognized the pattern, I found it odd that so many people who have good technical judgment, as evidenced by their ability to do good work and make comments showing good technical judgment, highly evaluated so many people who so frequently said blatantly incorrect things and produced poorly working or even non-working systems). Another cluster was around credentials, such as what school someone went to or what the person was leveled at or what prestigious companies they'd worked for. People could have judgment from multiple clusters, e.g., some folks would praise both people who did excellent technical work as well as people who are tall and confident. At higher levels, where it becomes more difficult to judge people's work, relatively fewer people based their judgment on people's output.

When I did this evaluation collating exercise at the startup I worked at, there was basically only one cluster and it was based on people's output, with fairly broad consensus about who the top engineers were, but I haven't seen that at any of the large companies I've worked for. I'm not going to say that means evaluation at that startup was fair (perhaps all of us were falling prey to the same biases), but at least we weren't falling prey to the most obvious biases.

Back to big companies, if we look at what it would take to reform the promotion system, it seems difficult to do when biased because many individual engineers are biased. Some companies have committees handle promotions in order to reduce bias, but the major inputs to the system still have strong biases. The committee uses, as input, recommendations from people, many of whom let those biases have more weight than their technical judgment. Even if we, hypothetically, introduced a system that identified whose judgments were highly correlated with factors that aren't directly relevant to performance and gave those recommendations no weight, people's opinions often limit the work that someone can do. A complaint I've heard from some folks who are junior is that they can't get promoted because their work doesn't fulfill promo criteria. When they ask to be allowed to do work that could get them promoted, they're told they're too junior to do that kind of work. They're generally stuck at their level until they find a manager who believes in their potential enough to give them work that could possibly result in a promo if they did a good job. Another factor that interacts with this is that it's easier to transfer to a team where high-impact work is available if you're doing well and/or having high "promo velocity", i.e., are getting promoted frequently and harder if you're doing poorly or even just have low promo velocity and aren't doing particularly poorly. At higher levels, it's uncommon to not be able to do high-impact work, but it's also very difficult to separate out the impact of individual performance and biases because a lot of performance is about who you can influence, which is going to involve trying to influence people who are biased if you need to do it at scale, which you generally do to get promoted at higher levels. The nested, multi-level, impact of bias makes it difficult to change the system in a way that would remove the impact of bias.

Although it's easy to be pessimistic when looking at the system as a whole, it's also easy to be optimistic when looking at what one can do as an individual. It's pretty easy to do what Bill Wight (the scout known for recommending "funny looking" baseball players) did and ignore what other people incorrectly think is important⁵. I worked for a company that did this which had, by far, the best engineering team of any company I've ever worked for. They did this by ignoring the criteria other companies cared about, e.g., hiring people from non-elite schools instead of focusing on pedigree, not ruling people out for not having practiced solving abstract problems on a whiteboard that people don't solve in practice at work, not having cultural fit criteria that weren't related to job performance (they did care that people were self-directed and would function effectively when given a high degree of independence), etc.⁶

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Peter Bhat Harkins, Yossi Kreinin, Pam Wolf, Laurie Tratt, Leah Hanson, Kate Meyer, Heath Borders, Leo T M, Valentin Hartmann, Sam El-Borai, Vaibhav Sagar, Nat Welch, Michael Malis, Ori Berstein, Sophia Wisdom, and Malte Skarupke for comments/corrections/discussion.

Appendix: other factors

This post used height as a running example because it's both something that's easy to observe is correlated to success in men which has been studied across a number of fields. I would guess that social class markers / mannerisms, as in the Jane example from this post, have at least as much impact. For example, a number of people have pointed out to me that the tall, successful, people they're surrounded by say things with very high confidence (often incorrect things, but said confidently) and also have mannerisms that convey confidence and authority.

Other physical factors also seem to have a large impact. There's a fairly large literature on how much the halo effect causes people who are generally attractive to be rated more highly on a variety of dimensions, e.g., morality. There's a famous ask metafilter (reddit before there was reddit) answer to a quesiton that's something like "how can you tell someone is bad?" and the most favorited answer (I hope for ironic reasons, although the answerer seemed genuine) is that they have bad teeth. Of course, in the U.S., having bad teeth is a marker of childhood financial poverty, not impoverished moral character. And, of course, gender is another dimension that people appear to filter on for reasons unrelated to talent or competence.

Another is just random luck. To go back to the baseball example, one of the few negative scouting reports on Chipper Jones came from a scout who said

Was not aggressive w/bat. Did not drive ball from either side. Displayed non-chalant attitude at all times. He was a disappointment to me. In the 8 games he managed to collect only 1 hit and hit very few balls well. Showed slap-type swing from L.side . . . 2 av. tools

Another scout, who saw him on more typical days, correctly noted

Definite ML prospect . . . ML tools or better in all areas . . . due to outstanding instincts, ability, and knowledge of game. Superstar potential.

Another similarly noted:

This boy has all the tools. Has good power and good basic approach at the plate with bat speed. Excellent make up and work-habits. Best prospect in Florida in the past 7 years I have been scouting . . . This boy must be considered for our [1st round draft] pick. Does everything well and with ease.

There's a lot of variance in performance. If you judge performance by watching someone for a short period of time, you're going to get wildly different judgements depending on when you watch them.

If you read the blind orchestra audition study that everybody cites, the study itself seems poor quality and unconvincing, but it also seems true that blind auditions were concomitant with an increase in orchestras hiring people who didn't look like what people expected musicians to look like. Blind auditions, where possible, seem like something good to try.

As noted previously, a professor remarked that doing hiring over zoom accidentally made height much less noticeable than normal and resulted in at least one university department hiring a number of professors who are markedly less tall than professors who were previously hired.

Me on how tech interviews don't even act as an effective filter for the main thing they nominally filter for.

Me on how prestige-focused tech hiring is.

@ArtiKel on Cowen and Gross's book on talent and on funding people over projects. A question I've had for a long time is whether the less-mainstream programs that convey prestige via some kind of talent selection process (Thiel Fellowship, grants from folks like Tyler Cowen, Patrick Collison, Scott Alexander, etc.) are less biased than traditional selection processes or just differently biased. The book doesn't appear to really answer this question, but it's food for thought. And BTW, I view these alternative processes as highly value even if they're not better and, actually, even if they're somewhat worse, because their existence gives the world a wider portfolio of options for talent spotting. But, even so, I would like to know if the alternative processes are better than traditional processes.

Alexy Guezy on where talent comes from.

An anonymous person on talent misallocation.

Thomas Ptacek on actually attempting to look at relevant signals when hiring in tech.

Me on the use of sleight of hand in an analogy meant to explain the importance of IQ and talent, where the sleight of hand is designed to make it seem like IQ is more important than it actually is.

Jessica Nordell on trans experiences demonstrating differences between how men and women are treated.

The Moneyball book, of course. Although, for the real nerdy details, I'd recommend reading the old baseballthinkfactory archives from back when the site was called "baseball primer". Fans were, in real time, calling out who would be successful and generally better greater success than baseball teams of the era. The site died off as baseball teams started taking stats seriously, leaving fan analysis in the dust since teams have access to both much better fine-grained data as well as time to spend on serious analysis than hobbyists, but it was interesting to watch hobbyists completely dominate the profession using basic data anlaysis techniques.

Jose Altuve comes from the modern era of statistics-driven decision making and therefore cannot be a counterexample. ^[return]
There's a similar bimodal split when I see discussions among people who are on the other side of the table and choose who gets to join an elite institution vs. not. Some people are utterly convinced that their judgment is basically perfect ("I just know", etc.), and some people think that making judgment calls on people is a noisy process and you, at best, get weak signal. ^[return]
Estimates range from 0 to 0.3, with Teasdale et al. finding that the correlation decreased over time (speculated to be due to better nutrition) and Teasdale et al. finding that, the correlation was significantly stronger than on average in the bottom tail (bottom 2% of height) and significantly weaker than on average at the top tail (top 2% of height), indicating that much of the overall correlation comes from factors that cause both reduced height and IQ.

In general, for a correlation coefficient of x, it will explain x^2 of the variance. So even if the correlation were not weaker at the high end and we had a correlation coefficient of 0.3, that would only explain 0.3 = 0.09 of the variance, i.e., 1 - 0.09 = 0.91 would be explained by other factors.
^[return]
When I did online dating, I frequently had people tell me that I must be taller than I am because they're so used to other people lying about their heights on dating profiles that they associated my height with a larger number than the real number. ^[return]
On the other side of the table, what one can do when being assessed, I've noticed that, at work, unless people are familiar with my work, they generally ignore me in group interactions, like meetings. Historically, things that have worked for me and gotten people to stop ignoring me were doing doing an unreasonably large amount of high-impact work in a small period of time (while not working long hours), often solving a problem that people thought was impossible to solve in the timeframe, which made it very difficult for people to not notice my work; another was having a person who appears more authoritative than me get the attention of the room and ask people to listen to me; and also finding groups (teams or orgs) that care more about the idea than the source of the idea. More recently, some things that have worked are writing this blog and using mediums where a lot of the cues that people use as proxies for competence aren't there (slack, and to a lesser extent, video calls).

In some cases, the pandemic has accidentally caused this to happen in some dimensions. For example, a friend of mine mentioned to me that their university department did video interviews during the pandemic and, for the first time, hired a number of professors who weren't strikingly tall.
^[return]
When at a company that has biases in hiring and promo, it's still possible to go scouting for talent in a way that's independent of the company's normal criteria. One method that's worked well for me is to hire interns, since the hiring criteria for interns tends to be less strict. Once someone is hired as an intern, if their work is great and you know how to sell it, it's easy to get them hired full-time.

For example, at Twitter, I hired two interns to my team. One, as an intern, wrote the kernel patch that solved the container throttling problem (at the margin, worth hundreds of millions of dollars a year) and has gone on to do great, high-impact, work as a full-time employee. The other, as an intern, built out across-the-fleet profiling, a problem many full-time staff+ engineers had wanted to solve but that no one had solved and is joining Twitter as a full-time employee this fall. In both cases, the person was overlooked by other companies for silly reasons. In the former case, there was a funny combination of reasons other companies weren't interested in hiring them for a job that utilized their skillset, including location / time zone (Australia). From talking to them, they clearly had deep knowledge about computer performance that would be very rare even in an engineer with a decade of "systems" experience. There were jobs available to them in Australia, but teams doing performance work at the other big tech companies weren't really interested in taking on an intern in Australia. For the kind of expertise this person had, I was happy to shift my schedule to a bit late for a while until they ramped up, and it turned out that they were highly independent and didn't really need guidance to ramp up (we talked a bit about problems they could work on, including the aforementioned container throttling problem, and then they came back with some proposed approaches to solve the problem and then solved the problem). In the latter case, they were a student who was very early in their university studies. The most desirable employers often want students who have more classwork under their belt, so we were able to hire them without much competition. Waiting until a student has a lot of classes under their belt might be a good strategy on average, but this particular intern candidate had written some code that was good for someone with that level of experience and they'd shown a lot of initiative (they reverse engineered the server protocol for a dying game in order to reimplement a server so that they could fix issues that were killing the game), which is a much stronger positive signal than you'll get out of interviewing almost any 3rd year student who's looking for an internship.

Of course, you can't always get signal on a valuable skill, but if you're actively scouting for people, you don't need to always get signal. If you occasionally get a reliable signal and can hire people who you have good signal on who are underrated, that's still valuable! For Twitter, in three intern seasons, I hired two interns, the first of whom already made "staff" and the second of whom should get there very quickly based on their skills as well as the impact of their work. In terms of ROI, spending maybe 30 hours a year on the lookout for folks who had very obvious signals indicating they were likely to be highly effective was one of the most valuable things I did for the company. The ROI would go way down if the industry as a whole ever started using effective signals when hiring but, for the reasons discussed in the body of this post, I expect progress to be slow enough that we don't really see the amount of change that would make this kind of work low ROI in my lifetime.
^[return]

A decade of major cache incidents at Twitter

Wed, 02 Feb 2022 00:00:00 +0000

This was co-authored with Yao Yue

This is a collection of information on severe (SEV-0 or SEV-1, the most severe incident classifications) incidents at Twitter that were at least partially attributed to cache from the time Twitter started using its current incident tracking JIRA (2012) to date (2022), with one bonus incident from before 2012. Not including the bonus incident, there were 6 SEV-0s and 6 SEV-1s that were at least partially attributed to cache in the incident tracker, along with 38 less severe incidents that aren't discussed in this post.

There are a couple reasons we want to write this down. First, historical knowledge about what happens at tech companies is lost at a fairly high rate and we think it's nice to preserve some of it. Second, we think it can be useful to look at incidents and reliability from a specific angle, putting all of the information into one place, because that can sometimes make some patterns very obvious.

On knowledge loss, when we've seen viral Twitter threads or other viral stories about what happened at some tech company, when we look into what happened, the most widely spread stories are usually quite wrong, generally for banal reasons. One reason is that outrageously exaggerated stories are more likely to go viral, so those are the ones that tend to be remembered. Another is that there's a cottage industry of former directors / VPs who tell self-aggrandizing stories about all the great things they did that, to put it mildly, frequently distort the truth (although there's nothing stopping ICs from doing this, the most spread false stories we see tend to come from people on the management track). In both cases, there's a kind of Gresham's law of stories in play, where incorrect stories tend to win out over correct stories.

And even when making a genuine attempt to try to understand what happened, it turns out that knowledge is lost fairly quickly. For this and other incident analysis projects we've done, links to documents and tickets from the past few years tend to work (90%+ chance), but older links are less likely to work, with the rate getting pretty close to 0% by the time we're looking at things from 2012. Sometimes, people have things squirreled away in locked down documents, emails, etc. but those will often link to things that are now completely dead, and figuring out what happened requires talking to a bunch of people who will, due to the nature of human memory, give you inconsistent stories that you need to piece together¹.

On looking at things from a specific angle, while looking at failures broadly and classifying and collating all failures is useful, it's also useful to drill down into certain classes of failures. For example, when Rebecca Isaacs and Dan Luu did an (internal, non-public) analysis of Twitter failover tests (from 2018 to 2020), which found a number of things that led to operational changes. In some sense, there was no new information in the analysis since the information we got all came from various documents that already existed, but putting into one place made a number of patterns obvious that weren't obvious when looking at incidents one at a time across multiple years.

This document shouldn't cause any changes at Twitter since looking at what patterns exist in cache incidents over time and what should be done about that has already been done, but collecting these into one place may still be useful to people outside of Twitter.

As for why we might want to look at cache failures (as opposed to failures in other systems), cache is relatively commonly implicated in major failures, as illustrated by this comment Yao made during an internal Twitter War Stories session (referring to the dark ages of Twitter, in operational terms):

Every single incident so far has at least mentioned cache. In fact, for a long time, cache was probably the #1 source of bringing the site down for a while.

In my first six months, every time I restarted a cache server, it was a SEV-0 by today's standards. On a good day, you might have 95% Success Rate (SR) [for external requests to the site] if I restarted one cache ...

Also, the vast majority of Twitter cache is (a fork of) memcached², which is widely used elsewhere, making the knowledge more generally applicable than if we discussed a fully custom Twitter system.

More generally, caches are nice source of relatively clean real-world examples of common distributed systems failure modes because of how simple caches are. Conceptually, a cache server is a high-throughput, low-latency RPC server plus a library that manages data, such as memory and/or disk and key value indices. For in memory caches, the data management side should be able to easily outpace the RPC side (a naive in-memory key-value library should be able to hit millions of QPS per core, whereas a naive RPC server that doesn't use userspace networking, batching and/or pipelining, etc. will have problems getting to 1/10th that level of performance). Because of the simplicity of everything outside of the RPC stack, cache can be thought of as an approximation of nearly pure RPC workloads, which are frequently important in heavily service-oriented architectures.

When scale and performance are concerns, cache will frequently use sharded clusters, which then subject cache to the constraints and pitfalls of distributed systems (but with less emphasis on synchronization issues than with some other workloads, such as strongly consistent distributed databases, due to the emphasis on performance). Also, by the nature of distributed systems, users of cache will be exposed to these failure modes and be vulnerable to or possibly implicated in failures caused by the cascading impact of some kinds of distributed systems failures.

Cache failure modes are also interesting because, when cache is used to serve a significant fraction of requests or fraction of data, cache outages or even degradation can easily cause a total outage because an architecture designed with cache performance in mind will not (and should not) have backing DB store performance that's sufficient to keep the site up.

Compared to most workloads, cache is more sensitive to performance anomalies below it in the stack (e.g., kernel, firmware, hardware, etc.) because it tends to have relatively high-volume and low-latency SLOs (because the point of cache is that it's fast) and it spends (barring things like userspace networking) a lot of time in kernel (~80% as a ballpark for Twitter memcached running normal kernel networking). Also, because cache servers often run a small number of threads, cache is relatively sensitive to being starved by other workloads sharing the same underlying resources (CPU, memory, disk, etc.). The high volume and low latency SLOs worsen positive feedback loops that lead to a "death spiral", a classic distributed systems failure mode.

When we look at the incidents below, we'll see that most aren't really due to errors in the logic of cache, but rather, some kind of anomaly that causes an insufficiently mitigated positive feedback loop that becomes a runaway feedback loop.

So, when reading the incidents below, it may be helpful to read them with an eye towards how cache interacts with things above cache in the stack that call caches and things below cache in the stack that cache interacts with. Something else to look for is how frequently a major incident occured due to an incompletely applied fix for an earlier incident or because something that was considered a serious operational issue by an engineer wasn't prioritized. These were both common themes in the analysis Rebecca Isaacs and Dan Luu did on causes of failover test failures as well.

2011-08 (SEV-0)

For a few months, a significant fraction of user-initiated changes (such as username, screen name, and password) would get reverted. There was continued risk of this for a couple more years.

Background

At the time, the Rails app had single threaded workers, managed by a single master that did health checks, redeploys, etc. If a worker got stuck for 30 seconds, the master would kill the worker and restart it.

Teams were running on bare metal, without the benefit of a cluster manager like mesos or kubernetes. Teams had full ownership of the hardware and were responsible for kernel upgrades, etc.

The algorithm for deciding which shard a key would land involved a hash. If a node went away, the keys that previously hashed to that node would end up getting hashed to other nodes. Each worker had a client that made its own independent routing decisions to figure out which cache shard to talk to, which means that each worker made independent decisions as to which cache nodes were live and where keys should live. If a client thinks that a host isn't "good" anymore, that host is said to be ejected.

Incident

On Nov 8, a user changed their name from [old name] to [new name]. One week later, their username reverted to [old name].

Between Nov 8th and early December, tens of these tickets were filed by support agents. Twitter didn't have the instrumentation to tell where things were going wrong, so the first two weeks of investigation was mostly getting metrics into the rails app to understand where the issue was coming from. Each change needed to be coordinated with the deploy team, which would take at least two hours. After the rails app was sufficiently instrumented, all signs pointed to cache as the source of the problem. The full set of changes needed to really determine if cache was at fault took another week or two, which included adding metrics to track cache inconsistency, cache exception paths, and host ejection.

After adding instrumentation, an engineer made the following comment on a JIRA ticket in early December:

I turned on code today to allow us to see the extent to which users in cache are out of sync with users in the database, at the point where we write the user in cache back to the database, at the point where we write the user in cache back to the database. The number is roughly 0.2% ... Checked 150 popular users on Twitter to see how many caches they were in (should be at most one). Most of them were on at least two, with some on as many as six.

The first fix was to avoid writing stale data back to the DB. However, that didn't address the issue of having multiple copies of the same data in different cache shards. The second fix, intended to reduce the number of times keys appeared in multiple locations, was to retry multiple times before ejecting a host. The idea is that, if a host is really permanently down, that will trigger an alert, but alerts for dead hosts weren't firing, so the errors that were causing host ejections should be transient and therefore, if a client keeps retrying, it should be able to find a key "where it's supposed to be". And then, to prevent flapping keys from hosts having many transient errors, the time that ejected hosts were kept ejected was increased.

This change was tested on one cache and the rolled out to other caches. Rolling out the change to all caches immediately caused the site to go down because ejections still occurred and the longer ejection time caused the backend to get stressed. At the time, the backend was MySQL, which, as configured, could take an arbitrarily long amount of time to return a request under high load. This caused workers to take an arbitrarily long time to return results, which caused the master to kill workers, which took down the site when this happened at scale since not enough workers were available to serve requests.

After rolling back the second fix, users could still see stale data since, even though stale data wasn't being written back to the DB, cache updates could happen to a key in one location and then a client could read a stale, cached, copy of that key in another location. Another mitigation that was deployed was to move the user data cache from a high utilization cluster to a low utilization cluster.

After debugging further, it was determined that retrying could address ejections occurring due to "random" causes of tail latency, but there was still a high rate of ejections coming from some kind of non-random cause. From looking at metrics, it was observed that there was sometimes a high rate of packet loss and that this was correlated with incoming packet rate but not bandwidth usage. Looking at the host during times of high packet rate and packet loss showed that CPU0 was spending 65% to 70% of time handling soft IRQs, indicating that the packet loss was likely coming from CPU0 not being able to keep with the packet arrival rate.

The fix for this was to set IRQ affinity to spread incoming packet processing across all of the physical cores on the box. After deploying the fix, packet loss and cache inconsistency was observed on the new cluster that user data was moved to but not the old cluster.

At this point, it's late December. Looking at other clusters, it was observed that some other clusters also had packet loss. Looking more closely, the packet loss was happening every 20 hours and 40 minutes on some specific machines. All machines that had this issue were a particular hardware SKU with a particular BIOS version (the latest version; machines from that SKU with earlier BIOS versions were fine). It turned out that hosts with this BIOS version were triggering the BMC to run a very expensive health check every 20 hours and 40 minutes which interrupted the kernel for the duration, preventing any packets from being processed, causing packet drops.

It turned out that someone from the kernel team had noticed this exact issue about six months earlier and had tried to push a kernel config change that would fix the issue (increasing the packet ring buffer size so that transient issues wouldn't cause the packet drops when the buffer overflowed). Although that ticket was marked resolved, the fix was never widely rolled out for reasons that are unclear.

A quick mitigation that was deployed was to stagger host reboot times so that clusters didn't have coordinated packet drops across the entire cluster at the same time.

Because the BMC version needs to match the BIOS version and the BMC couldn't be rolled back, it wasn't possible to fix the issue by rolling back the BIOS. In order to roll the BMC and BIOS forward, the HWENG team had to do emergency testing/qualification of those, which was done as quickly as possible, at which point the BIOS fix was rolled out and the packet loss went away.

The total time for everything combined was about two months.

However, this wasn't a complete fix since the host ejection behavior was still unchanged and any random issue that caused one or more clients but not all clients to eject a cache shard would still result in inconsistency. Fixing that required changing cache architectures, which couldn't be quickly done (that took about two years).

Mitigations / fixes:

Add visibility
Set IRQ affinity to avoid overloading CPU0
Fix firmware issue causing hosts to drop packets periodically
Fix cache architecture to one that can tolerate partitions without becoming inconsistent

Lessons learned:

Need visibility
Need low-level systems understanding to operate cache
Make isolated changes (one thing that confused the issue was migrating to new cluster at the same time as pushing IRQ affinity fix, which confusingly fixed one packet loss problem and introduced another one at the same time).

2012-07 (SEV-1)

Non-personalized trends didn't show up for ~10% of users for about 10 hours, who got an empty trends box.

An update to the rails app was deployed, after which the trends cache stopped returning results. This only impacted non-personalized trends because those were served directly from rails (personalized trends were served from a separate service).

Two hours in, it was determined that this was due to segfaults in the daemon that refreshes the trends cache, which was due to running out of memory. The reason this happened was that the deployed change added a Thrift field to the Trend object, which increased the trends cache refresh daemon memory usage beyond the limit.

There was an alert on the trends cache daemon failing, but it only checked for the daemon starting a run successfully, not for it finishing a run successfully.

Mitigations / fixes:

Increase ulimit
Alert changed to use job success as a criteria, not job startup
Add global 404 rate to global dashboard

Lessons learned

Alerts should use job success as a criteria, not job startup

2012-07 (SEV-0)

This was one of the more externally well-known Twitter incidents because this one resulted in the public error page showing, with no images or CSS:

Twitter is currently down for <% = reason %>

We expect to be back in <% = deadline %>

The site was significantly impacted for about four hours.

The information on this one is a bit sketchy since records from this time are highly incomplete (the JIRA ticket for this notes, "This incident was heavily Post-Mortemed and reviewed. Closing incident ticket.", but written documentation on the incident has mostly been lost).

The trigger for this incident was power loss in two rows of racks. In terms of the impact on cache, 48 hosts lost power and were restarted when power came back up, one hour later. 37 of those hosts had their caches fail to come back up because a directory that a script expected to exist wasn't mounted on those hosts. "Manually" fixing the layouts on those hosts took 30 minutes and caches came back up shortly afterwards.

The directory wasn't actually necessary for running a cache server, at least as they were run at Twitter at the time. However, there was a script that checked for the existence of the directory on startup that was not concurrently updated when the directory was removed from the layout setup script a month earlier.

Something else that increased debugging time was that /proc wasn't mounted properly on hosts when they came back up. Although that wasn't the issue, it was unusual and it took some time to determine that it wasn't part of the incident and was an independent non-urgent issue to be fixed.

If the rest of the site were operating perfectly, the cache issue above wouldn't have caused such a severe incident, but a number of other issues in combination caused a total site outage that lasted for an extended period of time.

Some other issues were:

Slow requests that should've timed out at 5 seconds didn't. Instead, they would continue for 30 seconds until the entire worker process that was working on the slow request was killed and restarted
- The code that was supposed to cause the 5 second timeout was being run, but it wasn't using the right timestamp to determine duration and therefore didn't trigger a timeout
User data service took a long time to recover
- Logging during failures used a large amount of resources and very high GC pressure
A number of non-cache hosts failed to come back up when rebooted, with issue including hanging at fsck or in a PXE boot loop
Although the site and error message were static, the outage page used Ruby wildcards, resulting in template messages being displayed to users
- This came from Twitter having recently migrated from having the rails app act as a front end to having a C++ front end; assets for errors were directly copied over and still had ERB templates
CSS didn't load because the part of the site CSS would've loaded from was down
Front end got overloaded and failed to restart properly when health checks found that shards were unhealthy

Cache mitigations / fixes:

Fix software that configures layouts to avoid issue in future
Audit existing hosts to fix issue on any hosts that were then-currently impacted
Make sure /proc is mounted on kernel upgrade
Create process for updating/upgrading software that configures layouts to reduce probability of introducing future bug
Make sure cache hosts (as well as other hosts) are spread more evenly across failure domains

Other mitigations / fixes (highly incomplete):

Set up disk / RAID health & maintenance on observability boxes
Send broken / unhealthy hosts to SiteOps for repair
Remove Ruby wildcards from outage page
Bundle CSS into outage page so that site CSS still works when other things are down but the outage page is up
Add load shedding to front end to drop traffic when overloaded
Change logging library for user data service to much cheaper logging library to prevent GC pressure from killing the service when error rate is high
Fix 5 second timeout to look at correct header
Add an independent timeout at a different level of the stack that should also fire if requests are completely failing to make progress
Change front-end health check and restart to forcibly kill nodes instead of trying to gracefully shut them down
Ensure only one version of the health check script is running on one node at any given time

Lessons learned:

Failure modes need to be actively tested, including failure modes that would cause a host reboot or a timeout
Need to have rack diversity requirements, so losing a couple racks won't disproportionately impact a small number of services

2013-01 (SEV-0)

Site outage for 3h30m

An increase in load (AFAIK, normal for the day, not an outlier load spike) caused a tail latency increase on cache. The tail latency increased on cache was caused by IRQ affinities not being set on new cache hosts, which caused elevated queue lengths and therefore elevated latency.

Increased cache latency along with the design of tweet service using cache caused shards of the service using cache to enter a GC death spiral (more latency -> more outstanding requests -> more GC pressure -> more load on the shard -> more latency), which then caused increased load on remaining shards.

At the time, the tweet service cache and user data cache were colocated onto the same boxes, with 1 shard of tweet service cache and 2 shards of user data cache per box. Tweet service cache added the new hosts without incident. User data cache then gradually added the new hosts over the course of an evening, also initially without incident. But when morning peak traffic arrived (peak traffic is in the morning because that's close to both Asian and U.S. peak usage times, with Asian countries generally seeing peak usage outside of "9-5" work hours and U.S. peak usage during work hours), that triggered the IRQ affinity issue. Tweet service was much more impacted by the IRQ affinity issue than the user data service.

Mitigations / fixes:

IRQ affinity needs to be set for cache hosts, per the 2011-08 incident
- Make this the default for boxes instead of having cache hosts do this as one-off changes
Change tweet service settings
- Reduce max number of connections
- Increase timeout
- No GC config changes made because, at the time, GC stats weren't exported as metrics and the GC logs weren't logging sufficient information to understand if bad GC settings were a contributing factor
Change settings for all services that use cache
- Adjust connection limits to ~2x steady state

2013-09 (SEV-1)

Overall site success rate dropped to 92% in one datacenter. Users were impacted for about 15 minutes.

The timeline service lost access to about 75% of one of the caches it uses. The cache team made a serverset change for that cache and the timeline service wasn't using the recommended mechanism to consume the cache serverset path and didn't "know" which servers were cache servers.

Mitigations / fixes:

Have timeline service use recommended mechanism for finding serverset path
Audit all code that consumes serverset paths to ensure no service is using a non-recommended mechanism for serverset paths

2014-01 (SEV-0)

The site went down in one datacenter, impacting users whose requests went to that datacenter for 20 minutes.

The tweet service started sending elevated load to caches. A then-recent change removed the cap on the number of connections that could be made to caches. At the time, when caches hit around ~160k connections, they would fail to accept new connections. This caused the monitoring service to be unable to connect to cache shards, which caused the monitoring service to restart cache shards, causing an outage.

In the months before the outage, there were five tickets describing various ingredients for the outage.

In one ticket, a follow-up to a less serious incident caused by a combination of bad C-state configs and SMIs, it was noted that caches stopped accepting connections at ~160k connections. An engineer debugged the issue in detail, figured out what was going on, and suggested a number of possible paths to mitigating the issue.

One ingredient is that, especially when cache is highly loaded, cache can not have accepted the connection even though the kernel will have established the TCP connection.

The client doesn't "know" that the connection isn't really open to the cache and will send a request and wait for a response. Finagle may open multiple connections if it "thinks" that more concurrency is needed. After 150ms, the request will time out. If the queue is long on the cache side, this is likely to be before the cache has even attempted to do anything about the request.

After the timeout, Finagle will try again and open another connection, causing the cache shard to become more overloaded each time this happens.

On the client side, each of these requests causes a lot of allocations, causing a lot of GC pressure.

At the time, settings allowed for 5 requests before marking a node as unavailable for 30 seconds, with 16 connection parallelism and each client attempting to connect to 3 servers. When all those numbers were multiplied out by the number of shards, that allowed the tweet service to hit the limits of what cache can handle before connections stop being accepted.

On the cache side, there was one dispatcher thread and N worker threads. The dispatcher thread would call listen and accept and then put work onto queues for worker threads. By default, the backlog length was 1024. When accept failed due to an fd limit, the dispatcher thread set backlog to 0 in listen and ignored all events coming to listening fds. Backlog got reset to normal and connections were accepted again when a connection was closed, freeing up an fd.

Before the major incident, it was observed that after the number of connections gets "too high", connections start getting rejected. After a period of time, the backpressure caused by rejected connections would allow caches to recover.

Another ingredient to the issue was that, on one hardware SKU, there were OOMs when the system ran out of 32kB pages under high cache load, which would increase load to caches that didn't OOM. This was fixed by a Twitter kernel engineer in

commit 96c7a2ff21501691587e1ae969b83cbec8b78e08
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Feb 10 14:25:41 2014 -0800

    fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
    
    Recently due to a spike in connections per second memcached on 3
    separate boxes triggered the OOM killer from accept.  At the time the
    OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
    problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
    hold a bitmap, and there was sufficient fragmentation that the largest
    page available was 8KiB.
    
    I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
    but I do agree that order 3 allocations are very likely to succeed.
    
    There are always pathologies where order > 0 allocations can fail when
    there are copious amounts of free memory available.  Using the pigeon
    hole principle it is easy to show that it requires 1 page more than 50%
    of the pages being free to guarantee an order 1 (8KiB) allocation will
    succeed, 1 page more than 75% of the pages being free to guarantee an
    order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
    the pages being free to guarantee an order 3 allocate will succeed.
    
    A server churning memory with a lot of small requests and replies like
    memcached is a common case that if anything can will skew the odds
    against large pages being available.
    
    Therefore let's not give external applications a practical way to kill
    linux server applications, and specify __GFP_NORETRY to the kmalloc in
    alloc_fdmem.  Unless I am misreading the code and by the time the code
    reaches should_alloc_retry in __alloc_pages_slowpath (where
    __GFP_NORETRY becomes signification).  We have already tried everything
    reasonable to allocate a page and the only thing left to do is wait.  So
    not waiting and falling back to vmalloc immediately seems like the
    reasonable thing to do even if there wasn't a chance of triggering the
    OOM killer.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Cong Wang <cwang@twopensource.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

and is another example of why companies the size of Twitter get value out of having a kernel team.

Another ticket noted the importance of having standardized settings for cache hosts for things like IRQ affinity, C-states, turbo boost, NIC bonding, and firmware version, which was a follow up to another ticket noting that the tweet service sometimes saw elevated latency on some hosts, which was ultimately determined to be due to increased SMIs after a kernel upgrade impacting one hardware SKU type due to some interactions between the kernel and the firmware version.

Cache Mitigations / fixes:

Reduce backlog from 1024 to 128 to apply back pressure more quickly when dispatcher is overloaded
Lower fd limit to avoid some shards running out of memory
Use a fixed hash table size in cache to avoid large load of allocations and memory/CPU load during hash table migration
Use CPU affinity on low latency memcache hosts

Tests with these mitigations indicated that, even without fixes to clients to prevent clients from "trying to" overwhelm caches, these prevented cache from falling over under conditions similar to the incident.

Tweet service Mitigations / fixes:

Change timeout, retry, and concurrent client connection settings to avoid overloading caches

Lessons learned:

Consistent hardware settings are important
Allowing high queue depth before applying backpressure can be dangerous
Clients should "do the math" when setting retry policies to avoid using retry policies that can completely overwhelm cache servers when 100% of responses fail and maximal backpressure is being applied

2014-03 (SEV-0)

A tweet from Ellen was retweeted very frequently during the Oscars, which resulted in search going down for about 25 minutes as well as a site outage that prevented many users from being able to use the site.

This incident had a lot of moving parts. From a cache standpoint, this was another example of caches becoming overloaded due to badly behaved clients.

It's similar to the 2014-01 incident we looked at, except that the cache-side mitigations put in place for that incident weren't sufficient because the "attacking" clients picked more aggressive values than were used by the tweet service during 2014-01 incident and, by this time, some caches were running in containerized environments on shared mesos, which made them vulnerable to throttling death spirals.

The major fix to this direct problem was to add pipelining to the Finagle memcached client, allowing most clients to get adequate throughput with only 1 or 2 connections, reducing the probability of clients hammering caches until they fall over.

For other services, there were close to 50 fixes put into place across many services. Some major themes were for the fixes were:

Add backpressure where appropriate
- Avoid retrying when backpressure is being applied
Make sure data (mostly) flows to the same DC to avoid expensive and slow cross-DC traffic
Create appropriate thread pools to prevent critical work from being starved
Add in-process caching for hot items
Return incomplete results for queries when under high load / don't fail requests if results are incomplete
Create guide for how to configure cache clients to avoid DDoSing cache

2016-01 (SEV-0)

SMAP, a former Japanese boy band that became a popular adult J-pop group as well the hosts of a variety show that was frequently the #1 watched show in Japan, held a conference to falsely deny rumors they were going to break up. This resulted in an outage in one datacenter that impacted users routed to that datacenter for ~20 minutes, until that DC was failed away from. It took about six hours for services in the impacted DC to recover.

The tweet service in one DC had a load spike, which caused 39 cache shard hosts to OOM kill processes on those hosts. The cluster manager didn't automatically remove the dead nodes from the server set because there were too many dead nodes (it will automatically remove nodes if a few fail, but if too many fail, this change is not automated due to the possibility of exacerbating some kind of catastrophic failure with an automated action since removing nodes from a cache server set can cause traffic spikes to persistent storage). When cache oncalls manually cleaned up the dead nodes, the service that should have restarted them failed to do so because a puppet change had accidentally removed cache related configs for the service would normally restart the nodes. Once the bad puppet commit was reverted, the cache shards came back up, but these initially came back too slowly and then later came back too quickly, causing recovery of tweet service success rate take an extended period of time.

The cache shard hosts were OOM killed because too much kernel socket buffer memory was allocated.

The initial fix for this was to limit TCP buffer size on hosts to 4 GB, but this failed a stress test and it was determined that memory fragmentation on hosts with high uptime (2 years) was the reason for the failure and the mitigation was to reboot hosts more frequently to clean up fragmentation.

Mitigations / fixes:

Reboot hosts more than once every two years
Add puppet alerts to cache boxes to detect breaking puppet changes
Change cluster manager to handle large changes better (change already in progress due to a previous, smaller, incident)

2016-02 (SEV-1)

This was the failed stress test from the 2016-01 SEV-0 mentioned above. This mildly degraded success rate to the site for a few minutes until the stress test was terminated.

2016-07 (SEV-1)

A planned migration of user data cache from dedicated hosts to Mesos led to significant service degradation in one datacenter and then minor degradation in another datacenter. Some existing users were impacted and all basically new user signups failed for about half an hour.

115 new cache instances were added to a serverset as quickly as the cluster manager could add them, reducing cache hit rates. The cache cluster manager was expected to add 1 shard every 20 minutes, but the configuration change accidentally changed the minimum cache cluster size, which "forced" the cluster manager to add the nodes as quickly as it could.

Adding so many nodes at once reduced user data cache hit rate from the normal 99.8% to 84%. In order to stop this from getting worse, operators killed the cluster manager to prevent it from adding more nodes to the serverset and then redeployed the cluster manager in its previous state to restore the old configuration, which immediately improved user data cache hit rate.

During the time period cache hit rate was degraded, the backing DB saw a traffic spike that caused long GC pauses. This caused user data service requests that missed cache to have a 0% success rate when querying the backing DB.

Although there was rate limiting in place to prevent overloading the backing DB, the thresholds were too high to trigger. In order to recover the backing DB, operators did a rolling restart and deployed strict rate limits. Since one datacenter was failed away from due to the above, the strict rate limit was hit in another datacenter because the failing away from one datacenter caused elevated traffic in another datacenter. This caused mildly reduced success rate in the user data service because requests were getting rejected by the strict rate limit, which is why this incident also impacted a datacenter that wasn't impacted by the original cache outage.

Mitigations / fixes:

Add a deploy hook that warns operators who are adding or removing a large number of nodes from a cache cluster
Add detailed information in runbooks about how to do deploys, cluster creation, expansion, shrinkage, etc.
Add a checklist for all "tier 0" (critical) cache deploys

2018-04 (SEV-0)

A planned test datacenter failover caused a partial site outage for about 1 hour. Degraded success rate was noticed 1 minute into the failover. The failover test was immediately reverted, but it took most of an hour for the site to fully recover.

The initial site degradation came from increased error rates in the user data service, which was caused by cache hot keys. There was a mechanism intended to cache hot keys, which sampled 1% of events (with sampling being used in order to reduce overhead, the idea being that if a key is hot, it should be noticed even with sampling) and put sampled keys into a FIFO queue with a hash map to count how often each key appears in the queue.

Although this worked for previous high load events, there were some instances where this didn't work as well as intended (but weren't a root cause in an incident) when the values are large because the 1% sampling rate wouldn't allow the cache to "notice" a hot key quickly enough in the case where there were large (and therefore expensive) values. The original hot key detection logic was designed for tweet service cache, where the largest keys were about 5KB. This same logic was then used for other caches, where keys can be much larger. User data cache wasn't a design consideration for hot keys because, at the time hot key promotion was designed, the user data cache wasn't having hot key issues because, at the time, the items that would've been the hottest keys were served from an in-process cache.

The large key issue was exacerbated by the use of FNV1-32 for key hashing, which ignores the least significant byte. The data set that was causing a problem had a lot of its variance inside the last byte, so the use of FNV1-32 caused all of the keys with large values to be stored on small number of cache shards. There were suggestions to move to migrate off of FNV1-32 at least as far back as 2014 for this exact reason and a more modern hash function was added to a utility library, but some cache owners chose not to migrate.

Because the hot key promotion logic didn't trigger, traffic to the hot cache shards saturated NIC bandwidth to the shards that had hot keys that were using 1Gb NICs (Twitter hardware is generally heterogenous unless someone ensures that clusters only have specific characteristics; although many cache hosts had 10Gb NICs, many also had 1Gb NICs).

Fixes / mitigations:

Tune user data cache hot key detection
Upgrade all hardware in the relevant cache clusters to hosts with 10Gb NICs
Switch some caches from FNV to murmur3

2018-06 (SEV-1)

During a test data center failover, success rate for some kinds of actions dropped to ~50% until the test failover was aborted, about four minutes later.

From a cache standpoint, the issue was that tweet service cache shards were able to handle much less traffic than expected (about 50% as much traffic) based on load tests that weren't representative of real traffic, resulting in the tweet service cache being under provisioned. Among the things that made the load test setup unrealistic were:

The arrival distribution was highly non-independent, with large spikes due to correlated arrivals when under load. It's common to assume either a constant or Poisson arrival distribution, but as we saw when looking at metrics data, the commonly used load generation assumption that arrivals are either constant or Poisson is false in a way that can result in unbounded difference between achievable throughput under actual load vs. a load generator that makes naive assumptions
The number of connections used in the load test was significantly smaller than the number of connections when under high load in practice

Also, a reason for degraded cache performance was that, once a minute, container-based performance counter collection was run for ten seconds, which was fairly expensive because many more counters were being collected than there are hardware counters, requiring the kernel to do expensive operations to switch out which counters are being collected.

The degraded performance both increased latency enough during the window when performance counters were collected that cache shards were unable to complete their work before hitting container throttling limits, degrading latency to the point that tweet service requests would time out. As configured, after 12 consecutive failures to a single cache node, tweet service clients would mark the node as dead for 30 seconds and stop issuing requests to it, causing the node to get no traffic for 30 seconds as clients independently made the decision to mark the node as dead. This caused increased request rates to increase past the request rate quota to the backing DB, causing requests to get rejected at the DB, increasing the failure rate of the tweet service.

Mitigations / fixes:

Reduced number of connections from tweet service client to cache from 4 to 2, which reduced latency
- As noted in a previous incident, adding pipelining allowed caches to operate efficiently with only 1 client connection, but some engineers were worried that 1 might not be enough because the number of connections was previously much higher, so 4 was chosen "just in case", but, with standard Linux kernel networking, having more connections increases tail latency, so this degraded performance
Add more cache nodes to reduce load on individual cache shards
Improve cache hot key promotion algorithm
- This wasn't specific to this incident, but an engineer did an analysis and found that the hot key promotion algorithm introduced a year ago had a cache hit rate of approximately 0.3% due to a combination of issues for one cache cluster. Switching to a better algorithm improved cache hit rate and performance significantly
Change cache qualification process so that the cache performance used to determine capacity (number of nodes) more accurately reflects real-world cache performance
Do a detailed analysis of the cost of multiplexed performance counter collection

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Michael Leinartas, Tao L., Michael Motherwell, Jonathan Riechhold, Stephan Zuercher, Justin Blank, Jamie Brandon, John Hergenroeder, and Ben Kuhn for comments/corrections/discussion.

Appendix: Pelikan cache

Pelikan was created to address issues we saw when operating memcached and Redis at scale. This document explains some of the motivations for Pelikan. The moduarlity / ease of modification has allowed us to discover novel cache innovations, such as a new eviction algorithm that addresses the problems we ran into with existing eviction algorithms.

With respect to the kinds of things discussed in this post, Pelikan has had more predictable performance, better median performance, and better performance in the tail than our existing caches when we've tested it in production, which means we get better reliaiblity and more capacity at a lower cost.

That knowledge decays at a high rate isn't unique to Twitter. In fact, of all the companies I've worked at as a full-time employee, I think Twitter is the best at preserving knowledge. The chip company I worked at, Centaur, basically didn't believe in written documentation other than having comprehensive bug reports, so many kinds of knowledge became lost very quickly. Microsoft was almost as bad since, by default, documents were locked down and fairly need-to-know, so basically nobody other than perhaps a few folks with extremely broad permissions would even be able to dig through old docs to understand how things had come about.

Google was a lot like Twitter is now in the early days, but as the company grew and fears about legal actions grew, especially after multiple embarrassing incidents when execs stated their intention to take unethical and illegal actions, things became more locked down, like Microsoft.
^[return]
There's also some use of a Redis fork, but the average case performance is significantly worse and the performance in the tail is relatively worse than the average case performance. Also, it has higher operational burden at scale directly due to its design, which limits its use for us. ^[return]

Cocktail party ideas

Wed, 02 Feb 2022 00:00:00 +0000

You don't have to be at a party to see this phenomenon in action, but there's a curious thing I regularly see at parties in social circles where people value intelligence and cleverness without similarly valuing on-the-ground knowledge or intellectual rigor. People often discuss the standard trendy topics (some recent ones I've observed at multiple parties are how to build a competitor to Google search and how to solve the problem of high transit construction costs) and explain why people working in the field today are doing it wrong and then explain how they would do it instead. I occasionally have good conversations that fit that pattern (with people with very deep expertise in the field who've been working on changing the field for years), but the more common pattern is that someone with cocktail-party level knowledge of a field will give their ideas on how the field can be fixed.

Asking people why they think their solutions would solve valuable problems in the field has become a hobby of mine when I'm at parties where this kind of superficial pseudo-technical discussion dominates the party. What I've found when I've asked for details is that, in areas where I have some knowledge, people generally don't know what sub-problems need to be solved to solve the problem they're trying to address, making their solution hopeless. After having done this many times, my opinion is that the root cause of this is generally that many people who have a superficial understanding of topic assume that the topic is as complex as their understanding of the topic instead of realizing that only knowing a bit about a topic means that they're missing an understanding of the full complexity of a topic.

Since I often attend parties with programmers, this means I often hear programmers retelling their cocktail-party level understanding of another field (the search engine example above notwithstanding). If you want a sample of similar comments online, you can often see these when programmers discuss "trad" engineering fields. An example I enjoyed was this Twitter thread where Hillel Wayne discussed how programmers without knowledge of trad engineering often have incorrect ideas about what trad engineering is like, where many of the responses are from programmers with little to no knowledge of trad engineering who then reply to Hillel with their misconceptions. When Hillel completed his crossover project, where he interviewed people who've worked in a trad engineering field as well as in software, he got even more such comments. Even when people are warned that naive conceptions of a field are likely to be incorrect, many can't help themselves and they'll immediately reply with their opinions about a field they know basically nothing about.

Anyway, in the crossover project, Hillel compared the perceptions of people who'd actually worked in multiple fields to pop-programmer perceptions of trad engineering. One of the many examples of this that Hillel gives is when people talk about bridge building, where he notes that programmers say things like

The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly. If we did not quickly adapt to the unforeseen, the only foreseeable event would be our own destruction.

and

No one thinks about moving the starting or ending point of the bridge midway through construction.

But Hillel interviewed a civil engineer who said that they had to move a bridge! Of course, civil engineers don't move bridges as frequently as programmers deal with changes in software but, if you talk to actual, working, civil engineers, many civil engineers frequently deal with changing requirements after a job has started that's not fundamentally different from what programmers have to deal with at their jobs. People who've worked in both fields or at least talk to people in the other field tend to think the concerns faced by engineers in both fields are complex, but people with a cocktail-party level of understanding of the field often claim that the field they're not in is simple, unlike their field.

A line I often hear from programmers is that programming is like "having to build a plane while it's flying", implicitly making the case that programming is harder than designing and building a plane since people who design and build planes can do so before the plane is flying¹. But, of course, someone who designs airplanes could just as easily say "gosh, my job would be very easy if I could build planes with 4 9s of uptime and my plane were allowed to crash and kill all of the passengers for 1 minute every week". Of course, the constraints on different types of projects and different fields make different things hard, but people often seem to have a hard time seeing constraints other fields have that their field doesn't. One might think that understanding that their own field is more complex than an outsider might naively think would help people understand that other fields may also have hidden complexity, but that doesn't generally seem to be the case.

If we look at the rest of the statement Hillel was quoting (which is from the top & accepted answer to a stack exchange question), the author goes on to say:

It's much easier to make accurate projections when you know in advance exactly what you're being asked to project rather than making guesses and dealing with constant changes.

The vast majority of bridges are using extremely tried and true materials, architectures, and techniques. A Roman engineer could be transported two thousand years into the future and generally recognize what was going on at a modern construction site. There would be differences, of course, but you're still building arches for load balancing, you're still using many of the same materials, etc. Most software that is being built, on the other hand . . .

This is typical of the kind of error people make when they're discussing cocktail-party ideas. Programmers legitimately gripe when clueless execs who haven't been programmers for a decade request unreasonable changes to a project that's in progress, but this is not so different and actually more likely to be reasonable than when politicians who've never been civil engineers require project changes on large scale civil engineering projects. It's plausible that, on average, programming projects have more frequent or larger changes to the project than civil engineering projects, I'd guess that the intra-field variance is at least as large as the inter-field variance.

And, of course, only someone who hasn't done serious engineering work in the physical world could say something like "The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly", thinking that the (relative) fixity of physical laws means that physical work is predictable. When I worked as a hardware engineer, a large fraction of the effort and complexity of my projects went into dealing with physical uncertainty and civil engineering is no different (if anything, the tools civil engineers have to deal with physical uncertainty on large scale projects are much worse, resulting in a larger degree of uncertainty and a reduced ability to prevent delays due to uncertainty).

If we look at how Roman engineering or even engineering from 300 years ago differs from modern engineering, a major source of differences is our much better understanding of uncertainty that comes from the physical world. It didn't used to be shocking when a structure failed not too long after being built without any kind of unusual conditions or stimulus (e.g., building collapse, or train accident due to incorrectly constructed rail). This is now rare enough that it's major news if it happens in the U.S. or Canada and this understanding also lets us build gigantic structures in areas where it would have been previously considered difficult or impossible to build moderate-sized structures.

For example, if you look at a large-scale construction project in the Vancouver area that's sitting on the delta (Delta, Richmond, much of the land going out towards Hope), it's only relatively recently that we discovered the knowledge necessary to build some large scale structures (e.g., tall-ish buildings) reliably on that kind of ground, which is one of the many parts of modern civil engineering a Roman engineer wouldn't understand. A lot of this comes from a field called geotechnical engineering, a sub-field of civil engineering (alternately, arguably its own field and also arguably a subfield of geological engineering) that involves the ground, i.e., soil mechanics, rock mechanics, geology, hydrology, and so on and so forth. One fundamental piece of geotechnical engineering is the idea that you can apply mechanics to reason about soil. The first known application of mechanics to soils, a fundamental part of geotechnical engineering, was in 1773 and geotechnical engineering as it's thought of today is generally said to have started in 1925. While Roman engineers did a lot of impressive work, the mental models they were operating with precluded understanding much of modern civil engineering.

Naturally, for this knowledge to have been able to change what we can build, it must change how we build. If we look at what a construction site on compressible Vancouver delta soils that uses this modern knowledge looks like, by wall clock time, it mostly looks like someone put a pile of sand on the construction site (preload). While a Roman engineer would know what a pile of sand is, they wouldn't know how someone figured out how much sand was needed and how long it needed to be there (in some cases, Romans would use piles or rafts where we would use preload today, but in many cases, they had no answer to the problems preload solves today).

Geotechnical engineering and the resultant pile of sand (preload) is one of tens of sub-fields where you'd need expertise when doing a modern, large scale, civil engineering project that a Roman engineer would need a fair amount of education to really understand.

Coming back to cocktail party solutions I hear, one common set of solutions is how to fix high construction costs and slow construction. There's a set of trendy ideas that people throw around about why things are so expensive, why projects took longer than projected, etc. Sometimes, these comments are similar to what I hear from practicing engineers that are involved in the projects but, more often than not, the reasons are pretty different. When the reasons are the same, it seems that they must be correct by coincidence since they don't seem to understand the body of knowledge necessary to reason through the engineering tradeoffs².

Of course, like cocktail party theorists, civil engineers with expertise in the field also think that modern construction is wasteful, but the reasons they come up with are often quite different from what I hear at parties³. It's easy to come up with cocktail party solutions to problems by not understanding the problem, assuming the problem is artificially simple, and then coming up with a solution to the imagined problem. It's harder to understand the tradeoffs in play among the tens of interacting engineering sub-fields required to do large scale construction projects and have an actually relevant discussion of what the tradeoffs should be and how one might motivate engineers and policy makers to shift where the tradeoffs land.

A widely cited study on the general phenomena of people having wildly oversimplified and incorrect models of how things work is this study by Rebecca Lawson on people's understanding of how bicycles work, which notes:

Recent research has suggested that people often overestimate their ability to explain how things function. Rozenblit and Keil (2002) found that people overrated their understanding of complicated phenomena. This illusion of explanatory depth was not merely due to general overconfidence; it was specific to the understanding of causally complex systems, such as artifacts (crossbows, sewing machines, microchips) and natural phenomena (tides, rainbows), relative to other knowledge domains, such as facts (names of capital cities), procedures (baking cakes), or narratives (movie plots).

And

It would be unsurprising if nonexperts had failed to explain the intricacies of how gears work or why the angle of the front forks of a bicycle is critical. Indeed, even physicists disagree about seemingly simple issues, such as why bicycles are stable (Jones, 1970; Kirshner, 1980) and how they steer (Fajans, 2000). What is striking about the present results is that so many people have virtually no knowledge of how bicycles function.

In "experiment 2" in the study, people were asked to draw a working bicycle and focus on the mechanisms that make the bicycle work (as opposed to making the drawing look nice) and 60 of the 94 participants had at least one gross error that caused the drawing to not even resemble a working bicycle. If we look at a large-scale real-world civil engineering project, a single relevant subfield, like geotechnical engineering, contains many orders of magnitude more complexity than a bicycle and it's pretty safe to guess that, to the nearest percent, zero percent of lay people (or Roman engineers) could roughly sketch out what the relevant moving parts are.

For a non-civil engineering example, Jamie Brandon quotes this excerpt from Jim Manzi's Uncontrolled, which is a refutation of a "clever" nugget that I've frequently heard trotted out at parties:

The paradox of choice is a widely told folktale about a single experiment in which putting more kinds of jam on a supermarket display resulted in less purchases. The given explanation is that choice is stressful and so some people, facing too many possible jams, will just bounce out entirely and go home without jam. This experiment is constantly cited in news and media, usually with descriptions like "scientists have discovered that choice is bad for you". But if you go to a large supermarket you will see approximately 12 million varieties of jam. Have they not heard of the jam experiment? Jim Manzi relates in Uncontrolled:

First, note that all of the inference is built on the purchase of a grand total of thirty-five jars of jam. Second, note that if the results of the jam experiment were valid and applicable with the kind of generality required to be relevant as the basis for economic or social policy, it would imply that many stores could eliminate 75 percent of their products and cause sales to increase by 900 percent. That would be a fairly astounding result and indicates that there may be a problem with the measurement.

... the researchers in the original experiment themselves were careful about their explicit claims of generalizability, and significant effort has been devoted to the exact question of finding conditions under which choice overload occurs consistently, but popularizers telescoped the conclusions derived from one coupon-plus-display promotion in one store on two Saturdays, up through assertions about the impact of product selection for jam for this store, to the impact of product selection for jam for all grocery stores in America, to claims about the impact of product selection for all retail products of any kind in every store, ultimately to fairly grandiose claims about the benefits of choice to society. But as we saw, testing this kind of claim in fifty experiments in different situations throws a lot of cold water on the assertion.

As a practical business example, even a simplification of the causal mechanism that comprises a useful forward prediction rule is unlikely to be much like 'Renaming QwikMart stores to FastMart will cause sales to rise,' but will instead tend to be more like 'Renaming QwikMart stores to FastMart in high-income neighborhoods on high-traffic roads will cause sales to rise, as long as the store is closed for painting for no more than two days.' It is extremely unlikely that we would know all of the possible hidden conditionals before beginning testing, and be able to design and execute one test that discovers such a condition-laden rule.

Further, these causal relationships themselves can frequently change. For example, we discover that a specific sales promotion drives a net gain in profit versus no promotion in a test, but next year when a huge number of changes occurs - our competitors have innovated with new promotions, the overall economy has deteriorated, consumer traffic has shifted somewhat from malls to strip centers, and so on - this rule no longer holds true. To extend the prior metaphor, we are finding our way through our dark room by bumping our shins into furniture, while unobserved gremlins keep moving the furniture around on us. For these reasons, it is not enough to run an experiment, find a causal relationship, and assume that it is widely applicable. We must run tests and then measure the actual predictiveness of the rules developed from these tests in actual implementation.

So far, we've discussed examples of people with no background in a field explaining how a field works or should work, but the error of taking a high-level view and incorrectly assuming that things are simple also happens when people step back and have a high-level view of their own field that's disconnected from the details. For example, back when I worked at Centaur and we'd not yet shipped a dual core chip, a nearly graduated PhD student in computer architecture from a top school asked me, "why don't you just staple two cores together to make a dual core chip like Intel and AMD? That's an easy win".

At that time, we'd already been working on going from single core to multi core for more than one year. Making a single core chip multi-core or even multi-processor capable with decent performance requires significant additional complexity to the cache and memory hierarchy, the most logically complex part of the chip. As a rough estimate, I would guess that taking a chip designed for single-core use and making it multi-processor capable at least doubles the amount of testing/verification effort required to produce a working chip (and the majority of the design effort that goes into a chip is on testing/verification). More generally, a computer architect is only as good as their understanding of the tradeoffs their decisions impact. Great ones have a strong understanding of the underlying fields they must interact with. A common reason that a computer architect will make a bad decision is that they have a cocktail party level understanding of the fields that are one or two levels below computer architecture. An example of a bad decision that's occurred multiple times in industry is when a working computer architect decides to add SMT to a chip because it's basically a free win. You pay a few percent extra area and get perhaps 20% better performance. I know of multple attempts to do this that completely failed for predictable reasons because the architect failed to account for the complexity and verification cost of adding SMT. Adding SMT adds much more complexity than adding a second core because the logic has to be plumbed through everything and it causes an explosion in the complexity of verifying the chip for the same reason. Intel famously added SMT to the P4 and did not enable in the first generation it was shipped in because it was too complex to verify in a single generation and had critical, showstopping, bugs. With the years of time they had to shake the bugs out on one generation of architecture, they fixed their SMT implementation and shipped it in the next generation of chips. This happened again when they migrated to the Core architecture and added SMT to that. A working computer architect should know that this happened twice to Intel, implying that verifying an SMT implementation is hard, and yet there have been multiple instances where someone had a cocktail party level of understanding of the complexity of SMT and suggested adding it to a design that did not have the verification budget to ever ship a working chip with SMT.

And, of course, this isn't really unique to computer architecture. I used the dual core example because it's one that happens to currently be top-of-mind for me, but I can think of tens of similar examples off the top of my head and I'm pretty sure I could write up a few hundred examples if I spent a few days thinking about similar examples. People working in a field still have to be very careful to avoid having an incorrect, too abstract, view of the world that elides details and draws comically wrong inferences or conclusions as a result. When people outside a field explain how things should work, their explanations are generally even worse than someone in the field who missed a critical consideration and they generally present crank ideas.

Bringing together the Roman engineering example and the CPU example, going from 1 core to 2 (and, in general, going from 1 to 2, as in 1 datacenter to 2 datacenters or a monolith to a distributed system) is something every practitioner should understand is hard, even if some don't. Somewhat relatedly, if someone showed off a 4 THz processor that had 1000x the performance of a 4 GHz processor, that's something any practitioner should recognize as alien technology that they definitely do not understand. Only a lay person with no knowledge of the field could reasonably think to themselves, "it's just a processor running at 1000x the clock speed; an engineer who can make a 4 GHz processor would basically understand how a 4 THz processor with 1000x the performance works". We are so far from being able to scale up performance by 1000x by running chips 1000x faster that doing so would require many fundamental breakthroughs in technology and, most likely, the creation of entirely new fields that contain more engineering knowledge than exists in the world today. Similarly, only a lay person could look at Roman engineering and modern civil engineering and think "Romans built things and we build things that are just bigger and more varied; a Roman engineer should be able to understand how we build things today because the things are just bigger". Geotechnical engineering alone contains more engineering knowledge than existed in all engineering fields combined in the Roman era and it's only one of the new fields that had to be invented to allow building structures like we can build today.

Of course, I don't expect random programmers to understand geotechnical engineering, but I would hope that someone who's making a comparison between programming and civil engineering would at least have some knowledge of civil engineering and not just assume that the amount of knowledge that exists in the field is roughly equal to their knowledge of the field when they know basically nothing about the field.

Although I seem to try a lot harder than most folks to avoid falling into the trap of thinking something is simple because I don't understand it, I still fall prey to this all the time and the best things I've come up with to prevent this, while better than nothing, are not reliable.

One part of this is that I've tried to cultivate noticing "the feeling of glossing over something without really understanding it". I think of this is analogous to (and perhaps it's actually the same thing as) something that's become trendy over the past twenty years, paying attention to how emotions feel in your body and understanding your emotional state by noticing feelings in your body, e.g., a certain flavor of tight feeling in a specific muscle is a sure sign that I'm angry.

There's a specific feeling I get in my body when I have a fuzzy, high-level, view of something and am mentally glossing over it. I can easily miss it if I'm not paying attention and I suspect I can also miss it when I gloss over something in a way where the non-conscious part of the brain that generates the feeling doesn't even know that I'm glossing over something. Although noticing this feeling is inherently unreliable, I think that everything else I might do that's self contained to check my own reasoning fundamentally relies on the same mechanism (e.g., if I have a checklist to try to determine if I haven't glossed over something when I'm reasoning about a topic, some part of that process will still rely on feeling or intuition). I do try to postmortem cases where I missed the feeling to figure out happened, and that's basically how I figured out that I have a feeling associated with this error in the first place (I thought about what led up to this class of mistake in the past and noticed that I have a feeling that's generally associated with it), but that's never going to perfect or even very good.

Another component is doing what I think of as "checking inputs into my head". When I was in high school, I noticed that a pretty large fraction of the "obviously wrong" things I said came from letting incorrect information into my head. I didn't and still don't have a good, cheap, way to tag a piece of information with how reliable it is, so I find it much easier to either fact-check or discard information on consumption.

Another thing I try to do is get feedback, which is unreliable and also intractable in the general case since the speed of getting feedback is so much slower than the speed of thought that slowing down general thought to the speed of feedback would result in having relatively few thoughts⁴.

Although, unlike in some areas, there's no mechanical, systematic, set of steps that can be taught that will solve the problem, I do think this is something that can be practiced and improved and there are some fields where similar skills are taught (often implicitly). For example, when discussing the prerequisites for an advanced or graduate level textbook, it's not uncommon to see a book say something like "Self contained. No prerequisites other than mathematical maturity". This is a shorthand way of saying "This book doesn't require you to know any particular mathematical knowledge that a high school student wouldn't have picked up, but you do need to have ironed out a kind of fuzzy thinking that almost every untrained person has when it comes to interpreting and understanding mathematical statements". Someone with a math degree will have a bunch of explicit knowledge in their head about things like Cauchy-Schwarz inequality and the Bolzano-Weierstrass theorem, but the important stuff for being able to understand the book isn't the explicit knowledge, but the general way one thinks about math.

Although there isn't really a term for the equivalent of mathematical maturity in other fields, e.g., people don't generally refer to "systems designs maturity" as something people look for in systems design interviews, the analogous skill exists even though it doesn't have a name. And likewise for just thinking about topics where one isn't a trained expert, like a non-civil engineer thinking about why a construction project cost what it did and took as long as it did, a sort of general maturity of thought⁵.

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Pam Wolf, Ben Kuhn, Yossi Kreinin, Fabian Giesen, Laurence Tratt, Danny Lynch, Justin Blank, A. Cody Schuffelen, Michael Camilleri, and Anonymous for comments/corrections discussion.

An anonymous blog reader gave this example of their own battle with cocktail party ideas:

Your most recent post struck a chord with me (again!), as I have recently learned that I know basically nothing about making things cold, even though I've been a low-temperature physicist for nigh on 10 years, now. Although I knew the broad strokes of cooling, and roughly how a dilution refrigerator works, I didn't appreciate the sheer challenge of keeping things at milliKelvin (mK) temperatures. I am the sole physicist on my team, which otherwise consists of mechanical engineers. We have found that basically every nanowatt of dissipation at the mK level matters, as does every surface-surface contact, every material choice, and so on.

Indeed, we can say that the physics of thermal transport at mK temperatures is well understood, and we can write laws governing the heat transfer as a function of temperature in such systems. They are usually written as P = aT^n. We know that different classes of transport have different exponents, n, and those exponents are well known. Of course, as you might expect, the difference between having 'hot' qubits vs qubits at the base temperature of the dilution refrigerator (30 mK) is entirely wrapped up in the details of exactly what value of the pre-factor a happens to be in our specific systems. This parameter can be guessed, usually to within a factor of 10, sometimes to within a factor of 2. But really, to ensure that we're able to keep our qubits cold, we need to measure those pre-factors. Things like type of fastener (4-40 screw vs M4 bolt), number of fasteners, material choice (gold? copper?), and geometry all play a huge role in the actual performance of the system. Oh also, it turns out n changes wildly as you take a metal from its normal state to its superconducting state. Fun!

We have spent over a year carefully modeling our cryogenic systems, and in the process have discovered massive misconceptions held by people with 15-20 years of experience doing low-temperature measurements. We've discovered material choices and design decisions that would've been deemed insane had any actual thermal modeling been done to verify these designs.

The funny thing is, this was mostly fine if we wanted to reproduce the results of academic labs, which mostly favored simpler experiment design, but just doesn't work as we leave the academic world behind and design towards our own purposes.

P.S. Quantum computing also seems to suffer from the idea that controlling 100 qubits (IBM is at 127) is not that different from 1,000 or 1,000,000. I used to think that it was just PR bullshit and the people at these companies responsible for scaling were fully aware of how insanely difficult this would be, but after my own experience and reading you post, I'm a little worried that most of them don't truly appreciate the titanic struggle ahead for us.

This is just a long-winded way of saying that I have held cocktail party ideas about a field in which I have a PhD and am ostensibly an expert, so your post was very timely for me. I like to use your writing as a springboard to think about how to be better, which has been very difficult. It's hard to define what a good physicist is or does, but I'm sure that trying harder to identify and grapple with the limits of my own knowledge seems like a good thing to do.

For a broader and higher-level discussion of clear thinking, see Julia Galef's Scout Mindset:

WHEN YOU THINK of someone with excellent judgment, what traits come to mind? Maybe you think of things like intelligence, cleverness, courage, or patience. Those are all admirable virtues, but there’s one trait that belongs at the top of the list that is so overlooked, it doesn’t even have an official name.

So I’ve given it one. I call it scout mindset: the motivation to see things as they are, not as you wish they were.

Scout mindset is what allows you to recognize when you are wrong, to seek out your blind spots, to test your assumptions and change course. It’s what prompts you to honestly ask yourself questions like “Was I at fault in that argument?” or “Is this risk worth it?” or “How would I react if someone from the other political party did the same thing?” As the late physicist Richard Feynman once said, “The first principle is that you must not fool yourself—and you are the easiest person to fool.”

As a tool to improve thought, the book has a number of chapters that give concrete checks that one can try, which makes it more (or at least more easily) actionable than this post, which merely suggests that you figure out what it feels like when you're glossing over something. But I don't think that the ideas in the book are a substitute for this post, in that the self-checks the book suggests don't directly attack the problem discussed in this post.

In one chapter, Galef suggests leaning into confusion (e.g., if some seemingly contradictory information gives rise to a feeling of confusion), which I agree with. I would add that there are a lot of other feelings that are useful to observe that don't really have a good name. When it comes to evaluating ideas, some that I try to note, beside the already mentioned "the feeling that I'm glossing over important details", are "the feeling that a certain approach is likely to pay off if pursued", "the feeling that an approach is really fraught/dangerous", "the feeling that there's critical missing information", "the feeling that something is really wrong", along with similar feelings that don't have great names.

For a discussion of how the movie Don't Look Up promotes the idea that the world is simple and we can easily find cocktail party solutions to problems, see this post by Scott Alexander.

Also, John Salvatier notes that reality has a surprising amount of detail.

Another one I commonly hear is that, unlike trad engineers, programmers do things that have never been done before ^[return]
Discussions about construction delays similarly ignore geotechnical reasons for delays. As with the above, I'm using geotechnical as an example of a sub-field that explains many delays because it's something I happen to be familiar with, not because it's the most important thing, but it is a major cause of delays and, on many kinds of projects, the largest cause of delays.

Going back to our example that a Roman engineer might, at best, superficially understand, the reason that we pile dirt onto the ground before building is that much of Vancouver has poor geotechnical conditions for building large structures. The ground is soft and will get unevenly squished down over time if something heavy is built on top of it. The sand is there as a weight, to pre-squish the ground.

As described in the paragraph above, this sounds straightforward. Unfortunately, it's anything but. As it happens, I've been spending a lot of time driving around with a geophysics engineer (a field that's related to but quite distinct from geotechnical engineering). When we drive over a funny bump or dip in the road, she can generally point out the geotechnical issue or politically motivated decision to ignore the geotechnical engineer's guidance that caused the bump to come into existence. The thing I find interesting about this is that, even though the level of de-risking done for civil engineering projects is generally much higher than is done for the electrical engineering projects I've worked on, where in turn it's much higher than on any software project I've worked on, enough "bugs" still make it into "production" that you can see tens or hundreds of mistakes in a day if you drive around, are knowledgeable, and pay attention.

Fundamentally, the issue is that humanity does not have the technology to understand the ground at anything resembling a reasonable cost for physically large projects, like major highways. One tool that we have is to image the ground with ground penetrating radar, but this results in highly underdetermined output. Another tool we have is to use something like a core drill or soil augur, which is basically digging down into the ground to see what's there. This also has inherently underdetermined output because we only get to see what's going on exactly where we drilled and the ground sometimes has large spatial variation in its composition that's not obvious from looking at it from the surface. A common example is when there's an unmapped remnant creek bed, which can easily "dodge" the locations where soil is sampled. Other tools also exist, but they, similarly, leave the engineer with an incomplete and uncertain view of the world when used under practical financial constraints.

When I listen to cocktail party discussions of why a construction project took so long and compare it to what civil engineers tell me caused the delay, the cocktail party discussion almost always exclusively discusses reasons that civil engineers tell me are incorrect. There are many reasons for delays and "unexpected geotechnical conditions" are a common one. Civil engineers are in a bind here since drilling cores is time consuming and expensive and people get mad when they see that the ground is dug up and no "real work" is happening (and likewise when preload is applied — "why aren't they working on the highway?"), which creates pressure on politicians which indirectly results in timelines that don't allow sufficient time to understand geotechnical conditions. This sometimes results in a geotechnical surprise during a project (typically phrased as "unforseen geotechnical conditions" in technical reports), which can result in major parts of a project having to switch to slower and more expensive techniques or, even worse, can necessitate a part of a project being redone, resulting in cost and schedule overruns.

I've never heard a cocktail party discussion that discusses geotechnical reasons for project delays. Instead, people talk about high-level reasons that are plausible sounding to a lay person, but completely fabricated, reasons that are disconnected from reality. But if you want to discuss how things can be built more quickly and cheaply, "progress studies", etc., this cannot be reasonably done without having some understanding of the geotechnical tradeoffs that are in play (as well as the tradeoffs from other civil engineering fields we haven't discussed).
^[return]
One thing we could do to keep costs under control is to do less geotechnical work and ignore geotechnical surprises up to some risk bound. Today, some of the "amount of work" done is determined by regulations and much of it is determined by case law, which gives a rough idea of what work needs to be done to avoid legal liability in case of various bad outcomes, such as a building collapse.

If, instead of using case law and risk of liability to determine how much geotechnical derisking should be done, we compute this based on QALYs per dollar, at the margin, we seem to spend a very large amount of money geotechnical derisking compared to many other interventions.

This is not just true of geotechnical work and is also true of other fields in civil engineering, e.g., builders in places like the U.S. and Canada do much more slump testing than is done in some countries that have a much faster pace of construction, which reduces the risk of a building's untimely demise. It would be both scandalous and a serious liability problem if a building collapsed because the builders of the building didn't do slump testing when they would've in the U.S. or Canada,, but buildings usually don't collapse even when builders don't do as much slump testing as tends to be done in the U.S. and Canada.

Countries that don't build to standards roughly as rigorous as U.S. or Canadian standards sometimes have fairly recently built structures collapse in ways that would be considered shocking in the U.S. and Canada, but the number of lives saved per dollar is very small compared to other places the money could be spent. Whether or not we should change this with a policy decision is a more relevant discussion to building costs and timelines than the fabricated reasons I hear cocktail party discussions of construction costs, but I've never heard this or other concrete reasons for project cost brought up outside of civil engineering circles.

Even if we just confine ourselves to work that's related to civil engineering as opposed to taking a broader, more EA-minded approach, and looking QALYs for all possible interventions, the tradeoff between resources spent on derisking during construction vs. resources spent derisking on an ongoing basis (inspections, maintenance, etc.), the relative resource levels weren't determined by a process that should be expected to produce anywhere near an optimal outcome.
^[return]
Some people suggest that writing is a good intermediate step that's quicker than getting external feedback while being more reliable than just thinking about something, but I find writing too slow to be usable as a way to clarify ideas and, after working on identifying when I'm having fuzzy thoughts, I find that trying to think through an idea to be more reliable as well as faster. ^[return]
One part of this that I think is underrated by people who have a self-image of "being smart" is where book learning and thinking about something is sufficient vs. where on-the-ground knowledge of the topic is necessary.

A fast reader can read the texts one reads for most technical degrees in maybe 40-100 hours. For a slow reader, that could be much slower, but it's still not really that much time. There are some aspects of problems where this is sufficient to understand the problem and come up with good, reasonable, solutions. And there are some aspects of problems where this is woefully inefficient and thousands of hours of applied effort are required to really be able to properly understand what's going on.
^[return]

The container throttling problem

Sat, 18 Dec 2021 00:00:00 +0000

This is an excerpt from an internal document David Mackey and I co-authored in April 2019. The document is excerpted since much of the original doc was about comparing possible approaches to increasing efficency at Twitter, which is mostly information that's meaningless outside of Twitter without a large amount of additional explanation/context.

At Twitter, most CPU bound services start falling over at around 50% reserved container CPU utilization and almost all services start falling over at not much more CPU utilization even though CPU bound services should, theoretically, be able to get higher CPU utilizations. Because load isn't, in general, evenly balanced across shards and the shard-level degradation in performance is so severe when we exceed 50% CPU utilization, this makes the practical limit much lower than 50% even during peak load events.

This document will describe potential solutions to this problem. We'll start with describing why we should expect this problem given how services are configured and how the Linux scheduler we're using works. We'll then look into case studies on how we can fix this with config tuning for specific services, which can result in a 1.5x to 2x increase in capacity, which can translate into $[redacted]M/yr to $[redacted]M/yr in savings for large services. While this is worth doing and we might get back $[redacted]M/yr to $[redacted]M/yr in TCO by doing this for large services, manually fixing services one at a time isn't really scalable, so we'll also look at how we can make changes that can recapture some of the value for most services.

The problem, in theory

Almost all services at Twitter run on Linux with the CFS scheduler, using CFS bandwidth control quota for isolation, with default parameters. The intention is to allow different services to be colocated on the same boxes without having one service's runaway CPU usage impact other services and to prevent services on empty boxes from taking all of the CPU on the box, resulting in unpredictable performance, which service owners found difficult to reason about before we enabled quotas. The quota mechanism limits the amortized CPU usage of each container, but it doesn't limit how many cores the job can use at any given moment. Instead, if a job "wants to" use more than that many cores over a quota timeslice, it will use more cores than its quota for a short period of time and then get throttled, i.e., basically get put to sleep, in order to keep its amortized core usage below the quota, which is disastrous for tail latency¹.

Since the vast majority of services at Twitter use thread pools that are much larger than their mesos core reservation, when jobs have heavy load, they end up requesting and then using more cores than their reservation and then throttling. This causes services that are provisioned based on load test numbers or observed latency under load to over provision CPU to avoid violating their SLOs. They either have to ask for more CPUs per shard than they actually need or they have to increase the number of shards they use.

An old example of this problem was the JVM Garbage Collector. Prior to work on the JVM to make the JVM container aware, each JVM would default the GC parallel thread pool size to the number of cores on the machine. During a GC, all these GC threads would run simultaneously, exhausting the cpu quota rapidly causing throttling. The resulting effect would be that a subsecond stop-the-world GC pause could take many seconds of wallclock time to complete. While the GC issue has been fixed, the issue still exists at the application level for virtually all services that run on mesos.

The problem, in practice [case study]

As a case study, let's look at service-1, the largest and most expensive service at Twitter.

Below is the CPU utilization histogram for this service just as it starts failing its load test, i.e., when it's just above the peak load the service can handle before it violates its SLO. The x-axis is the number of CPUs used at a given point in time and the y-axis is (relative) time spent at that utilization. The service is provisioned for 20 cores and we can see that the utilization is mostly significantly under that, even when running at nearly peak possible load:

The problem is the little bars above 20. These spikes caused the job to use up its CPU quota and then get throttled, which caused latency to drastically increase, which is why the SLO was violated even though average utilization is about 8 cores, or 40% of quota. One thing to note is that the sampling period for this graph was 10ms and the quota period is 100ms, so it's technically possible to see an excursion above 20 in this graph without throttling, but on average, if we see a lot of excursions, especially way above 20, we'll likely get throttling.

After reducing the thread pool sizes to avoid using too many cores and then throttling, we got the following CPU utilization histogram under a load test:

This is at 1.6x the load (request rate) of the previous histogram. In that case, the load test harness was unable to increase load enough to determine peak load for service-1 because the service was able to handle so much load before failure that the service that's feeding it during the load test couldn't keep it and send more load (although that's fixable, I didn't have the proper permissions to quickly fix it). [later testing showed that the service was able to handle about 2x the capacity after tweaking the thread pool sizes]

This case study isn't an isolated example — Andy Wilcox has looked at the same thing for service-2 and found similar gains in performance under load for similar reasons.

For services that are concerned about latency, we can get significant latency gains if we prefer to get latency gains instead of cost reduction. For service-1, if we leave the provisioned capacity the same instead of cutting by 2x, we see a 20% reduction in latency.

The gains for doing this for individual large services are significant (in the case of service-1, it's [mid 7 figures per year] for the service and [low 8 figures per year] including services that are clones of it, but tuning every service by hand isn't scalable. That raises the question: how many services are impacted?

Thread usage across the fleet

If we look at the number of active threads vs. number of reserved cores for moderate sized services (>= 100 shards), we see that almost all services have many more threads that want to execute than reserved cores. It's not uncommon to see tens of runnable threads per reserved core. This makes the service-1 example, above, look relatively tame, at 1.5 to 2 runnable threads per reserved core under load.

If we look at where these threads are coming from, it's common to see that a program has multiple thread pools where each thread pool is sized to either twice the number of reserved cores or twice the number of logical cores on the host machine. Both inside and outside of Twitter, It's common to see advice that thread pool size should be 2x the number of logical cores on the machine. This advice probably comes from a workload like picking how many threads to use for something like a gcc compile, where we don't want to have idle resources when we could have something to do. Since threads will sometimes get blocked and have nothing to do, going to 2x can increase throughput over 1x by decreasing the odds that any core is every idle, and 2x is a nice, round, number.

However, there are a few problems with applying this to Twitter applications:

Most applications have multiple, competing, thread pools
Exceeding the reserved core limit is extremely bad
Having extra threads working on computations can increase latency

The "we should provision 2x the number of logical cores" model assumes that we have only one main thread pool doing all of the work and that there's little to no downside to having threads that could do work sit and do nothing and that we have a throughput oriented workload where we don't care about the deadline of any particular unit of work.

With the CFS scheduler, threads that have active work that are above the core reservation won't do nothing, they'll get scheduled and run, but this will cause throttling, which negatively impacts tail latency.

Potential Solutions

Given that we see something similar looking to our case study on many services and that it's difficult to push performance fixes to a lot of services (because service owners aren't really incentivized to take performance improvements), what can we do to address this problem across the fleet and just on a few handpicked large services? We're going to look at a list of potential solutions and then discuss each one in more detail, below.

Better defaults for cross-fleet threadpools (eventbus, netty, etc.)
Negotiating ThreadPool sizes via a shared library
CFS period tuning
CFS bandwidth slice tuning
Other scheduler tunings
CPU pinning and isolation
Overprovision at the mesos scheduler level

Better defaults for cross-fleet threadpools

Potential impact: some small gains in efficiency
Advantages: much less work than any comprehensive solution, can be done in parallel with more comprehensive solutions and will still yield some benefit (due to reduced lock contention and context switches) if other solutions are in place.
Downsides: doesn't solve most of the problem.

Many defaults are too large. Netty default threadpool size is 2x the reserved cores. In some parts of [an org], they use a library that spins up eventbus and allocates a threadpool that's 2x the number of logical cores on the host (resulting in [over 100] eventbus threads) when 1-2 threads is sufficient for most of their eventbus use cases.

Adjusting these default sizes won't fix the problem, but it will reduce the impact of the problem and this should be much less work than the solutions below, so this can be done while we work on a more comprehensive solution.

Negotiating ThreadPool sizes via a shared library (API)

[this section was written by Vladimir Kostyukov]

Potential impact: can mostly mitigate the problem for most services.
Advantages: quite straightforward to design and implement; possible to make it first-class in Finagle/Finatra.
Downsides: Requires service-owners to opt-in explicitly (adopt a new API for constructing thread-pools).

CSL’s util library has a package that bridges in some integration points between an application and a JVM (util-jvm), which could be a good place to host a new API for negotiating the sizes of the thread pools required by the application.

The look and feel of such API is effectively dictated by how granular the negotiation is needed to be. Simply contending on a total number of allowed threads allocated per process, while being easy to implement, doesn’t allow distinguishing between application and IO threads. Introducing a notion of QoS for threads in the thread pool (i.e., “IO thread; can not block”, “App thread; can block”), on the other hand, could make the negotiation fine grained.

CFS Period Tuning

Potential impact: small reduction tail latencies by shrinking the length of the time period before the process group’s CFS runtime quota is refreshed.
Advantages: relatively straightforward change requiring few minimal changes.
Downsides: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota. May result in more total throttling.

To limit CPU usage, CFS operates over a time window known as the CFS period. Processes in a scheduling group take time from the CFS quota assigned to the cgroup and this quota is consumed over the cfs_period_us in CFS bandwidth slices. By shrinking the CFS period, the worst case time between quota exhaustion causing throttling and the process group being able to run again is reduced proportionately. Taking the default values of a CFS bandwidth slice of 5ms and CFS period of 100ms, in the worst case, a highly parallel application could exhaust all of its quota in the first bandwidth slice leaving 95ms of throttled time before any thread could be scheduled again.

It's possible that total throttling would increase because the scheduled time over 100ms might not exceed the threshold even though there are (for example) 5ms bursts that exceed the threshold.

CFS Bandwidth Slice Tuning

Potential impact: small reduction in tail latencies by allowing applications to make better use of the allocated quota.
Advantages: relatively straightforward change requiring minimal code changes.
Downsides: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.

When CFS goes to schedule a process it will transfer run-time between a global pool and CPU local pool to reduce global accounting pressure on large systems.The amount transferred each time is called the "slice". A larger bandwidth slice is more efficient from the scheduler’s perspective but a smaller bandwidth slice allows for more fine grained execution. In debugging issues in [link to internal JIRA ticket] it was determined that if a scheduled process fails to consume its entire bandwidth slice, the default slice size being 5ms, because it has completed execution or blocked on another process, this time is lost to the process group reducing its ability to consume all available resources it has requested.

The overhead of tuning this value is expected to be minimal, but should be measured. Additionally, it is likely not a one size fits all tunable, but exposing this to the user as a tunable has been rejected in the past in Mesos. Determining a heuristic for tuning this value and providing a per application way to set it may prove infeasible.

Other Scheduler Tunings

Potential Impact: small reduction in tail latencies and reduced throttling.
Advantages: relatively straightforward change requiring minimal code changes.
Downsides: comes at potentially increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.

The kernel has numerous auto-scaling and auto-grouping features whose impact to scheduling performance and throttling is currently unknown. kernel.sched_tunable_scaling can adjust kernel.sched_latency_ns underneath our understanding of its value. kernel.sched_min_granularity_ns and kernel.sched_wakeup_granularity_ns can be tuned to allow for preempting sooner, allowing better resource sharing and minimizing delays. kernel.sched_autogroup_enabled may currently not respect kernel.sched_latency_nsleading to more throttling challenges and scheduling inefficiencies. These tunables have not been investigated significantly and the impact of tuning them is unknown.

CFS Scheduler Improvements

Potential impact: better overall cpu resource utilization and minimized throttling due to CFS inefficiencies.
Advantages: improvements are transparent to userspace.
Downsides: the CFS scheduler is complex so there is a large risk to the success of the changes and upstream reception to certain types of modifications may be challenging.

How the CFS scheduler deals with unused slack time from the CFS bandwidth slice has shown to be ineffective. The kernel team has a patch to ensure that this unused time is returned back to the global pool for other processes to use, https://lore.kernel.org/patchwork/patch/907450/ to ensure better overall system resource utilization. There are some additional avenues to explore that could provide further enhancements. Another of many recent discussions in this area that fell out of a k8s throttling issue(https://github.com/kubernetes/kubernetes/issues/67577) is https://lkml.org/lkml/2019/3/18/706.

Additionally, CFS may lose efficiency due to bugs such as [link to internal JIRA ticket] and http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf. However, we haven't spent much time looking at the CFS performance for Twitter’s particular use cases. A closer look at CFS may find ways to improve efficiency.

Another change which has more upside and downside potential would be to use a scheduler other than CFS.

CPU Pinning and Isolation

Potential impact: removes the concept of throttling from the system by making the application developer’s mental model of a CPU map to a physical one.
Advantages: simplified understanding from application developer’s perspective, scheduler imposed throttling is no longer a concept an application contends with, improved cache efficiency, much less resource interference resulting in more deterministic performance.
Disadvantages: greater operational complexity, oversubscription is much more complicated, significant changes to current operating environment

The fundamental issue that allows throttling to occur is that a heavily threaded application can have more threads executing in parallel than the “number of CPUs” it requested resulting in an early exhaustion of available runtime. By restricting the number of threads executing simultaneously to the number of CPUs an application requested there is now a 1:1 mapping and an application’s process group is free to consume the logical CPU thread unimpeded by the scheduler. Additionally, by dedicating a CPU thread rather than a bandwidth slice to the application, the application is now able to take full advantage of CPU caching benefits without having to contend with other applications being scheduled on the same CPU thread while it is throttled or context switched away.

In Mesos, implementing CPU pinning has proven to be quite difficult. However, in k8s there is existing hope in the form of a project from Intel known as the k8s CPU Manager. The CPU Manager was added as an alpha feature to k8s in 1.8 and has been enabled as a beta feature since 1.10. It has somewhat stalled in beta as few people seem to be using it but the core functionality is present. The performance improvements promoted by the CPU Manager project are significant as shown in examples such as https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ and https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf While these benchmarks should be looked at with some skepticism, it does provide promising hope for exploring this avenue. A cursory inspection of the project highlights a few areas where work may still be needed but it is already in a usable state for validating the approach. Underneath, the k8s CPU Manager leverages the cpuset cgroup functionality that is present in the kernel.

Potentially, this approach does reduce the ability to oversubscribe the machines. However, the efficiency gains from minimized cross-pod interference, CPU throttling, a more deterministic execution profile and more may offset the need to oversubscribe. Currently, the k8s CPU Manager does allow for minor oversubscription in the form of allowing system level containers and the daemonset to be oversubscribed, but on a pod scheduling basis the cpus are reserved for that pod’s use.

Experiments by Brian Martin and others have shown significant performance benefits from CPU pinning that are almost as large as our oversubscription factor.

Longer term, oversubscription could be possible through a multitiered approach of wherein a primary class of pods is scheduled using CPU pinning but a secondary class of pods that is not as latency sensitive is allowed to float across all cores consuming slack resources from the primary pods. The work on the CPU Manager side would be extensive. However, recently Facebook has been doing some work on the kernel scheduler side to further enable this concept in a way that minimally impacts the primary pod class that we can expand upon or evolve.

Oversubscription at the cluster scheduler level

Potential impact: can bring machine utilization up to an arbitrarily high level and overprovisioning "enough".
Advantages: oversubscription at the cluster scheduler level is independent of the problem described in this doc; doing it in a data-driven way can drive machine utilization up without having to try to fix the specific problems described here. This could simultaneously fix the problem in this doc (low CPU utilization due to overprovisioning to avoid throttling) while also fixing [reference to document describing another problem].
Disadvantages: we saw in [link to internal doc] that shards of services running on hosts with high load have degraded performance. Unless we change the mesos scheduler to schedule based on actual utilization (as opposed to reservation), some hosts would end up too highly loaded and services with shards that land on those hosts would have poor performance.

Disable CFS quotas

Potential impact: prevents throttling and allows services to use all available cores on a box by relying on the "shares" mechanism instead of quota.
Advantages: in some sense, can gives us the highest possible utilization.
Disadvantages: badly behaved services could severely interfere with other services running on the same box. Also, service owners would have a much more difficult time predicting the performance of their own service since performance variability between the unloaded and loaded state would be much larger.

This solution is what was used before we enabled quotas. From a naive hardware utilization standpoint, relying on the shares mechanism seems optimal since this means that, if the box is underutilized, services can take unused cores, but if the box becomes highly utilized, services will fall back to taking their share of cores, proportional to their core reseration. However, when we used this system, most service owners found it too difficult to estimate performance under load for this to be practical. At least one company has tried this solution to fix their throttling problem and has had severe incidents under load because of it. If we switched back to this today, we'd be no better off than we were before we were before we enabled quotes.

Given how we allocate capacity, two ingredients that would make this work better than it did before include having a more carefully controlled request rate to individual shards and a load testing setup that allowed service owners to understand what things would really look like during a load spike, as opposed to our system, which only allows injection of unrealistic load to individual shards, which both has the problem that the request mix isn't the same as it is under a real load spike and that the shard with injected load isn't seeing elevated load from other services running on the same box. Per [another internal document], we know that one of the largest factors impacting shard-level performance is overall load on the box and that the impact on latency is non-linear and difficult to predict, so there's not really a good way to predict performance under actual load from performance under load tests with the load testing framework we have today.

Although these missing ingredients are important, high impact, issues, addressing either of these issues is beyond the scope of this doc; [Team X] owns load testing and is working on load testing and it might be worth revisiting this when the problem is solved.

An intermediate solution would be to set the scheduler quota to a larger value than the number of reserved cores in mesos, which would bound the impact of having "too much" CPU available causing unpredictable performance while potentially reducing throttling when under high load because the scheduler will effective fall back to the shares mechanism if the box is highly loaded. For example, if the cgroup quota was twice the the mesos quota, services that fall over at 50% of reserved mesos CPU usage would then instead fall over at 100% of reserved mesos CPU usage. For boxes at high load, the higher overall utilization would reduce throttling because the increased load from other cores would mean that a service that has too many runnable threads wouldn't be able to have as many of those threads execute. This has a weaker version of the downside of disabling in quota, in that, from [internal doc], we know that load on a box from other services is one of the largest factors in shard-level performance variance and this would, if we don't change how many mesos cores are reserved on a box, increase load on boxes. And if we do proportionately decrease the number of mesos reserved cores on a box, that makes the change pointless in that it's equivalent to just doubling every service's CPU reservation, except that having it "secretly" doubled would probably reduce the number of people who ask the question, "Why can't I exceed X% CPU in load testing without the service falling over?"

Results

This section was not in the original document from April 2019; it was written in December 2021 and describes work that happened as a result of the original document.

The suggestion of changing default thread pool sizes was taken and resulted in minor improvements. More importantly, two major efforts came out of the document. Vladimir Kostyukov (from the CSL team) and Flavio Brasil (from the JVM team) created Finagle Offload Filter and Xi Yang (my intern² at the time and now a full-time employee for my team) created a kernel patch which eliminates container throttling (the patch is still internal, but will hopefully eventually upstreamed).

Almost all applications that run on mesos at Twitter run on top of Finagle. The Finagle Offload Filter makes it trivial for service owners to put application work onto a different thread pool than IO (which was often not previously happening). In combination with sizing thread pools properly, this resulted in, ceteris paribus, applications having drastically reduced latency, enabling them to reduce their provisioned capacity and therefore their cost while meeting their SLO. Depending on the service, this resulted in a 15% to 60% cost reduction for the service.

The kernel patch implements the obvious idea of preventing containers from using more cores than a container's quota at every moment instead of allowing a container to use as many cores as are available on the machine and then putting the container to sleep if it uses too many cores to bring its amortized core usage down.

In experiments on hosts running major services at Twitter, this has the expected impact of eliminating issues related to throttling, giving a roughly 50% cost reduction for a typical service with untuned thread pool sizes. And it turns out the net impact is larger than we realized when we wrote this document due to the reduction in interference caused by preventing services from using "too many" cores and then throttling³. Also, although this was realized at the time, we didn't note in the document that the throttling issue causes shards to go from "basically totally fine" to a "throttling death spiral" that's analogous to a "GC death spiral" with only a small amount of additional load, which increases the difficulty of operating systems reliably. What happens is that, when a service is under high load, it will throttle. Throttling doesn't prevent requests from coming into the shard that's throttled, so when the shard wakes up from being throttled, it has even more work to do than it had before it throttled, causing it to use even more CPU and throttle more quickly, which causes even more work to pile up. Finagle has a mechanism that can shed load for shards that are in very bad shape (clients that talk to the dead server will mark the server as dead and stop sending request for a while) but, shards tend to get into this bad state when overall load to the service is high, so marking a node as dead just means that more load goes to other shards, which will then "want to" enter a throttling death spiral. Operating in a regime where throttling can cause a death spiral is an inherently metastable state. Removing both of these issues is arguably as large an impact as the cost reduction we see from eliminating throttling.

Xi Yang has experimented with variations on the naive kernel scheduler change mentioned above, but even the naive change seems to be quite effective compared to no change, even though the naive change does mean that services will often not be able to hit their full CPU allocation when they ask for it, e.g., if a service requests no CPU for the first half a period and then requests infinite CPU for the second half of the period, under the old system, it would get its allocated amount of CPU for the period, but under the new system, it would only get half. Some of Xi's variant patches address this issue in one way or another, but that has a relatively small impact compared to preventing throttling in the first place.

An independent change Pratik Tandel drove that reduced the impact of throttling on services by reducing the impact of variance between shards was to move to fewer larger shards. The main goal for that change was to reduce overhead due to duplicate work/memory that happens across all shards, but it also happens to have an impact due to larger per-shard quotas reducing the impact of random noise. Overall, this resulted in 0% to 20% reduced CPU usage and 10% to 40% reduced memory usage of large services at Twitter, depending on the service.

https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/
Adding burstiness
- https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
- https://lkml.org/lkml/2019/11/26/196
- https://lwn.net/Articles/840595/
- A container that exceeds its allocation will still throttle, but the idea of "burst capacity" is added, allowing more margin before throttling while keeping basically the same average core utilization
  - Allowing burstiness is independent of our fix, which prevents throttling and, in principle, both ideas could be applied at the same time, which would be somewhat like how network isolation works if you enable htb qdisc
  - Given the workloads and configurations that Twitter has, this does not fix the throttling problem for us with respect to either achieving very high per-container CPU utilization or preventing a the metastability caused by threat of throttling death spiral, although it does allow us to use slightly more average CPU than without enabling burstiness
Runtime level parallelism limiting
- Since Go typically uses a single thread pool, Uber was able to work around this issue by limiting the maximum number of running goroutines via https://github.com/uber-go/automaxprocs
  - Unfortunately for Twitter, a number of Twitter's largest and most expensive services, including service-1, use multiple language runtimes, so there isn't a simple way to bound the parallelism at the runtime level
- The .NET runtime has had adaptive thread pool sizes for a decade, one of the many ways the .NET stack is more advanced than what we commonly see at trendy tech companies

Thanks to Xi Yang, Ilya Pronin, Ian Downes, Rebecca Isaacs, Brian Martin, Vladimir Kotsyukov, Moses Nakamura, Flavio Brasil, Laurence Tratt, Akshay Shah, Julian Squires, Michael Greenberg @synrotek, and Miguel Angel Corral for comments/corrections/discussion

if this box is highly loaded, because there aren't enough cores to go around, then a container may not get all of the cores it requests, but this doesn't change the fundamental problem. ^[return]
I often joke that interns get all of the most interesting work, while us full-time employees are stuck with the stuff interns don't want to do. ^[return]
In an independent effort, Matt Tejo found that, for a fixed average core utilization, services that throttle cause a much larger negative impact on other services on the same host than services that use a constant number of cores. That's because a service that's highly loaded and throttling toggles between attempting to use all of the cores on the box and then using none of the cores on the box, causing an extremely large amount of interference during the periods where it's attempting to use all of the cores on the box. ^[return]

Some thoughts on writing

Mon, 13 Dec 2021 00:00:00 +0000

I see a lot of essays framed as writing advice which are actually thinly veiled descriptions of how someone writes that basically say "you should write how I write", e.g., people who write short posts say that you should write short posts. As with technical topics, I think a lot of different things can work and what's really important is that you find a style that's suitable to you and the context you operate in. Copying what's worked for someone else is unlikely to work for you, making "write how I write" bad advice.

We'll start by looking at how much variety there's been in what's worked¹ for people, come back to what makes it so hard to copy someone else's style, and then discuss what I try to do in my writing.

If I look at the most read programming blogs in my extended social circles² from 2000 to 2017³, it's been Joel Spolsky, Paul Graham, Steve Yegge, and Julia Evans (if you're not familiar with these writers, see the appendix for excerpts that I think are representative of their styles). Everyone on this list has a different style in the following dimensions (as well as others):

Topic selection
Prose style
Length
Type of humor (if any)
Level of technical detail
Amount of supporting evidence
Nuance

To pick a simple one to quantify, length, Julia Evans and I both started blogging in 2013 (she has one post from 2012, but she's told me that she considers her blog to have started in earnest when she was at RC, in September 2013, the same month I started blogging). Over the years, we've compared notes a number of times and, until I paused blogging at the end of 2017, we had a similar word count on our blogs even though she was writing roughly one order of magnitude more posts than I do.

To look at a few aspects that are difficult to quantify, consider this passage from Paul Graham, which is typical of his style:

What nerds like is the kind of town where people walk around smiling. This excludes LA, where no one walks at all, and also New York, where people walk, but not smiling. When I was in grad school in Boston, a friend came to visit from New York. On the subway back from the airport she asked "Why is everyone smiling?" I looked and they weren't smiling. They just looked like they were compared to the facial expressions she was used to.

If you've lived in New York, you know where these facial expressions come from. It's the kind of place where your mind may be excited, but your body knows it's having a bad time. People don't so much enjoy living there as endure it for the sake of the excitement. And if you like certain kinds of excitement, New York is incomparable. It's a hub of glamour, a magnet for all the shorter half-life isotopes of style and fame.

Nerds don't care about glamour, so to them the appeal of New York is a mystery.

It uses multiple aspects of what's sometimes called classic style. In this post, when I say "classical style", I mean as the term is used by Thomas & Turner, not a colloquial meaning. What that means is really too long to reasonably describe in this post, but I'll say that one part of it is that the prose is clean, straightforward, and simple; an editor whose slogan is "omit needless words" wouldn't have many comments. Another part is that the clean-ness of the style goes past the prose to what information is presented, so much so that supporting evidence isn't really presented. Thomas & Turner say "truth needs no argument but only accurate presentation". An example that exemplifies both of these is this passage from Rochefoucauld:

Madame de Chevreuse had sparkling intelligence, ambition, and beauty in plenty; she was flirtatious, lively, bold, enterprising; she used all her charms to push her projects to success, and she almost always brought disaster to those she encountered on her way.

Thomas & Turner said this about Rochefoucauld's passage:

This passage displays truth according to an order that has nothing to do with the process by which the writer came to know it. The writer takes the pose of full knowledge. This pose implies that the writer has wide and textured experience; otherwise he would not be able to make such an observation. But none of that personal history, personal experience, or personal psychology enters into the expression. Instead the sentence crystallizes the writer’s experience into a timeless and absolute sequence, as if it were a geometric proof.

Much of this applies to the passage by Paul Graham (though not all, since he tells us an anecdote about a time a friend visited Boston from New York and he explicitly says that you would know such and such "if you've lived in New York" instead just stating what you would know).

My style is opposite in many ways. I often have long, meandering, sentences, not for any particular literary purpose, but just because it reflects how I think. Strunk & White would have a field day with my writing. To the extent feasible, I try to have a structured argument and, when possible, evidence, with caveats for cases where the evidence isn't applicable. Although not presenting evidence makes something read cleanly, that's not my choice because I don't like that the reader basically has to take or leave it with respect to bare assertions, such as "what nerds like is the kind of town where people walk around smiling" and would prefer if readers know why I think something so they can agree or disagree based on the underlying reasons.

With length, style, and the other dimensions mentioned, there isn't a right way and a wrong way. A wide variety of things can work decently well. Though, if popularity is the goal, then I've probably made a sub-optimal choice on length compared to Julia and on prose style when compared to Paul. If I look at what causes other people to gain a following, and what causes my RSS to get more traffic, for me to get more Twitter followers, etc., publishing short posts frequently looks more effective than publishing long posts less frequently.

I'm less certain about the impact of style on popularity, but my feeling is that, for the same reason that making a lot of confident statements at a job works (gets people promoted), writing confident, unqualified, statements, works (gets people readers). People like confidence.

But, in both of these cases, one can still be plenty popular while making a sub-optimal choice and, for me, I view optimizing for other goals to be more important than optimizing for popularity. On length, I frequently cover topics that can't be covered in brief easily, or perhaps at all. One example of this is my post on branch prediction, which has two goals: give a programmer with no background in branch prediction or even computer architecture a historical survey and teach them enough to be able to read and understand a modern, state-of-the-art paper on branch prediction. That post comes in at 5.8k words. I don't see how to achieve the same goals with a post that comes in at the lengths that people recommend for blog posts, 500 words, 1000 words, 1500 words, etc. The post could probably be cut down a bit, but every predictor discussed is either a necessary building block used to explain later predictors except the agree predictor or of historical importance. But if the agree predictor wasn't discussed, it would still be important to discuss at least one interference-reducing scheme since why interference occurs and what can be done to reduce it is a fundamental concept in branch prediction.

There are other versions of the post that could work. One that explains that branch prediction exists at all could probably be written in 1000 words. That post, written well, would have a wider audience, be more popular, but that's not what I want to write.

I have an analogous opinion on style because I frequently want to discuss things in a level of detail and with a level of precision that precludes writing cleanly in the classic style. A specific, small, example is that, on a recent post, a draft reader asked me to remove a double negative and I declined because, in that case, the double negative had different connotations from the positive statement that might've replaced it and I had something precise I wanted to convey that isn't what would've been conveyed if I simplified the sentence.

A more general thing is that Paul writes about a lot of "big ideas" at a high level. That's something that's amenable to writing in a clean, simple style; what Paul calls an elegant style. But I'm not interested in writing about big ideas that are disconnected from low-level details and it's difficult to effectively discuss low-level details without writing in a style Paul would call inelegant.

A concrete example of this is my discussion of command line tools and the UNIX philosophy. Should we have tools that "do one thing and do it well" and "write programs to handle text streams, because that is a universal interface" or use commands that have many options and can handle structured data? People have been trading the same high-level rebuttals back and forth for decades. But the moment we look at the details, look at what happens when these ideas get exposed to the real world, we can immediately see that one of these sets of ideas couldn't possibly work as espoused.

Coming back to writing style, if you're trying to figure out what stylistic choices are right for you, you should start from your goals and what you're good at and go from there, not listen to somebody who's going to tell you to write like them. Besides being unlikely to work for you even if someone is able to describe what makes their writing tick, most advice is written by people who don't understand how their writing works. This may be difficult to see for writing if you haven't spent a lot of time analyzing writing, but it's easy to see this is true if you've taken a bunch of dance classes or had sports instruction that isn't from a very good coach. If you watch, for example, the median dance instructor and listen to their instructions, you'll see that their instructions are quite different from what they actually do. People who listen and follow instructions instead of attempting to copy what the instructor is doing will end up doing the thing completely wrong. Most writing advice similarly fails to capture what's important.

Unfortunately, copying someone else's style isn't easy either; most people copy entirely the wrong thing. For example, Natalie Wynn noted that people who copy her style often copy the superficial bits without understanding what's driving the superficial bits to be the way they are:

One thing I notice is when people aren’t saying anything. Like when someone’s trying to do a “left tube video essay” and they shove all this opulent shit onscreen because contrapoints, but it has nothing to do with the topic. What’s the reference? What are you saying??

I made a video about shame, and the look is Eve in Eden because Eve was the first person to experience shame. So the visual is connected to the concept and hopefully it resonates more because of that. So I guess that’s my advice, try to say something

If you look into what people who excel in their field have to say, you'll often see analogous remarks about other fields. For example, in Practical Shooting, Rob Leatham says:

What keeps me busy in my classes is trying to help my students learn how to think. They say, "Rob holds his hands like this...," and they don't know that the reason I hold my hands like this is not to make myself look that way. The end result is not to hold the gun that way; holding the gun that way is the end result of doing something else.

And Brian Enos says:

When I began ... shooting I had only basic ideas about technique. So I did what I felt was the logical thing. I found the best local shooter (who was also competitive nationally) and asked him how I should shoot. He told me without hesitation: left index finger on the trigger guard, left elbow bent and pulling back, classix boxer stance, etcetera, etcetera. I adopted the system blindly for a year or two before wondering whether there might be a system that better suited my structure and attitude, and one that better suited the shooting. This first style that I adopted didn't seem to fit me because it felt as though I was having to struggle to control the gun; I was never actually flowing with the gun as I feel I do now. My experimentation led me to pull ideas from all types of shooting styles: Isosceles, Modified Weaver, Bullseye, and from people such as Bill Blankenship, shotgunner John Satterwhite, and martial artist Bruce Lee.

But ideas coming from your environment only steer you in the right direction. These ideas can limit your thinking by their very nature ... great ideas will arise from a feeling within yourself. This intuitive awareness will allow you to accept anything that works for you and discard anything that doesn't

I'm citing those examples because they're written up in a book, but I've heard basically the same comment from instructors in a wide variety of activities, e.g., dance instructors I've talked to complain that people will ask about whether, during a certain motion, the left foot should cross in front or behind the right foot, which is missing the point since what matters is the foot placement is reasonable given how the person's center of gravity is moving, which may mean that the foot should cross in front or behind, depending on the precise circumstance.

The more general issue is that a person who doesn't understand the thing they're trying to copy will end up copying unimportant superficial aspects of what somebody else is doing and miss the fundamentals that drive the superficial aspects. This even happens when there are very detailed instructions. Although watching what other people do can accelerate learning, especially for beginners who have no idea what to do, there isn't a shortcut to understanding something deeply enough to facilitate doing it well that can be summed up in simple rules, like "omit needless words"⁴.

As a result, I view style as something that should fall out of your goals, and goals are ultimately a personal preference. Personally, some goals that I sometimes have are:

Explain a technical topic that a lot of people don't seem to understand at a level that's accessible to almost any professional programmer
- Examples: branch prediction, malloc, cache partitioning
Make a case for a minority opinion (or one that was a minority opinion at the time, anyway):
- Examples: files are difficult to use, public tech companies can pay very well, monorepos aren't stupid
Measure something
Discuss phenomena I think are interesting:
- Examples: funny discontinuities, the difficulty of knowledge transfer, normalization of deviance

When you combine one of those goals with the preference of discussing things in detail, you get a style that's different from any of the writers mentioned above, even if you want to use humor as effectively as Steve Yegge, write for as broad an audience as Julia Evans, or write as authoritatively as Paul Graham.

When I think about major components of my writing, the major thing that I view as driving how I write besides style & goals is process. As with style, I view this as something where a wide variety of things can work, where it's up to you to figure out what works for you.

For myself, I had the following process goals when I started my blog:

Low up-front investment, with as little friction as possible, maybe increasing investment over time if I continue blogging
Improve writing technique/ability with each post without worrying too much about writing quality any specific post
Only publish when I have something I feel is worth publishing
Write a blog that I would want to subscribe to
Write on my own platform

The low-up front investment goal is because, when I surveyed blogs I'd seen, one of the most common blog formats was a blog that contained a single post explaining that person was starting a blog, perhaps with another post explaining how their blog was set up, with no further posts. Another common blog format were blogs that had regular posts for a while, followed by a long dormant period with a post at the end explaining that they were going to start posting again, followed by no more posts (in some cases, there are a few such posts, with more time between each). Given the low rate of people continuing to blog after starting a blog, I figured I shouldn't bother investing in blog infra until I knew I was going to write for a while so, even though I already owned this domain name, I didn't bother figuring out how to point this domain at github pages and just set up a default install of some popular blogging software and I didn't even bother doing that until I had already written a post. In retrospect, it was a big mistake to use Octopress (Jekyll); I picked it because I was hanging out with a bunch of folks who were doing trendy stuff at the time, but the fact that it was so annoying to set up that people organized little "Octopress setup days" was a bad sign. And it turns out that, not only was it annoying to set up, it had a fair amount of breakage, used a development model that made it impossible to take upstream updates, and it was extremely slow (it didn't take long before it took a whole minute to build my blog, a ridiculous amount of time to "compile" a handful of blog posts). I should've either just written pure HTML until I had a few posts and then turned that into a custom static site generator, or used WordPress, which can be spun up in minutes and trivially moved or migrated from. But, part of the the low up-front investment involved not doing research into this and trusting that people around me were making reasonable decisions⁵. Overall, I stand behind the idea of keeping startup costs low, but had I just ignored all of the standard advice and either done something minimal or used the out-of-fashion but straightforward option, I would've saved myself a lot of work.

The "improve writing" goal is because I found my writing annoyingly awkward and wanted to fix that. I frequently wrote sentences or paragraphs that seemed clunky to me, like when you misspell a word and it looks wrong no matter how you try re-spelling it. Spellcheckers are now ubiquitous enough that you don't really run into the spelling problem anymore, but we don't yet have automated tools that will improve your writing (some attempts exist, but they tend to create bad writing). I didn't worry about any specific post since I figured I could easily spend years working on my writing and I didn't think that spending years re-editing a single post would be very satisfying.

As we've discussed before, getting feedback can greatly speed up skill acquisition, so I hired a professional editor whose writing I respect with the instruction "My writing is clunky and awkward and I'd like to fix it. I don't really care about spelling and grammar issues. Can you edit my writing with that in mind?". I got detailed feedback on a lot of my posts. I tried to fix the issues brought up in the feedback but, more importantly, tried to write my next post without it having the same or other previously mentioned issues. I can be a bit of a slow learner, so it sometimes took a few posts to iron out an issue but, over time, my writing improved a lot.

The only publishing when I felt like publishing is because I generally prefer process goals to outcome goals, at least with respect to personal goals. I originally had a goal of spending a certain amount of time per month blogging, but I got rid of that when I realized that I'd tend to spend enough time writing regardless of whether or not I made it an obligation. I think that outcome goals with respect to blogging do work for some people (e.g., "publish one post per week"), but if your goal is to improve writing quality, having outcome goals can be counterproductive (e.g., to hit a "publish one post per week goal" on limited time, someone might focus on getting something out the door and then not think about how to improve quality since, from the standpoint of the outcome goal, improving quality is a waste of time).

Having a goal of writing something I'd want to subscribe to is, of course, highly arbitrary. There are a bunch of things I don't like in other blogs, so I try to avoid them. Some examples:

Breaking up what could be a single post into a bunch of smaller posts
Clickbait titles
Repeatedly blogging about the same topic with nothing new to say
- A sub-category of this is having some kind of belief and then blogging about it every time a piece of evidence shows up that confirms the belief while not mentioning evidence that shows up that disconfirms the belief
Not having an RSS or atom feed

Writing on my own platform is the most minor of these. A major reason for that comes out of what's happened to platforms. At the time I started my blog, a number of platforms had already come and gone. Most recently, Twitter had acquired Posterous and shut it down. For a while, Posterous was the trendiest platform around and Twitter's decision to kill it entirely broke links to many of the all-time top voted HN posts, among others. Blogspot, a previously trendy place to write, had also been acquired by Google and severely degraded the reader experience on many sites afterwards. Avoiding trendy platforms has worked out well. The two trendy platforms people were hopping on when I started blogging were Svbtle and Medium. Svbtle was basically abandoned shortly afterward I started my blog when it became clear that Medium was going to dominate Svbtle on audience size. And Medium never managed to find a good monetization strategy and severely degraded the user experience for readers in an attempt to generate enough revenue to justify its valuation after raising $160M. You can't trust someone else's platform to not disappear underneath you or radically change in the name of profit.

A related thing I wanted to do was write in something that's my own space (as opposed to in internet comments). I used to write a lot of HN comments⁶, but the half-life of an HN comment is short. With very few exceptions, basically all of the views a comment is going to get will be in the first few days. With a blog, it's the other way around. A post might get burst of traffic initially but, as long as you keep writing, most traffic will come later (e.g., for my blog, I tend to get roughly twice as many hits as the baseline level when a post is on HN, and of course I don't have a post on HN most days). It isn't really much more work to write a "real blog post" instead of writing an HN comment, so I've tended to favor writing blog posts instead of HN comments. Also, when I write here, most of the value created is split between myself and readers. If I were to write on someone else's platform, most of the value would be split between the platform and readers. If I were doing video, I might not really have a choice outside of YouTube or Twitch but, for text, I have a real choice. Looking at how things worked out for people who made the other choice and decided to write comments for a platform, I think I made the right choice for the right seasons. I do see the appeal of the reduced friction commenting on an existing platform offers but, even so, I'd rather pay the cost of the extra friction and write something that's in my space instead of elsewhere.

All of that together is basically it. That's how I write.

Unlike other bloggers, I'm not going to try to tell you "how to write usefully" or "how to write well" or anything like that. I agree with Steve Yegge when he says that you should consider writing because it's potentially high value and the value may show up in ways you don't expect, but how you write should really come from your goals and aptitudes.

Appendix: changes in approach over time

When I started the blog, I used to worry that a post wouldn't be interesting enough because it only contained a simple idea, so I'd often wait until I could combine two or more ideas into a single post. In retrospect, I think many of my early posts would've been better off as separate posts. For example, this post on compensation from 2016 contains the idea that compensation might be turning bimodal and that programmers are unbelievably well paid given the barriers to entry compared to other fields that are similarly remunerative, such has finance, law, and medicine. I don't think there was much value-add to combining the two ideas into a single post and I think a lot more people would've read the bit about how unusually highly paid programmers are if it wasn't bundled into a post about compensation becoming bimodal.

Another thing I used to do is avoid writing things that seem too obvious. But, I've come around to the idea that there's a lot of value in writing down obvious things and a number of my most influential posts have been on things I would've previously considered too obvious to write down:

Excluding these recent posts, more people have told me that look-stupid/ has changed how they operate than all other posts combined (and the only reason it's even close is that a lot of people have told me that my discussions of compensation caused them to realize that they can find a job they enjoy more that also pays hundreds of thousands a year more than they were previously making, which is the set of posts that's drawn the most comments from people telling me that the post was pointless because everybody knows how much you can make in tech).

A major, and relatively recent, style change I'm trying out is using more examples. This was prompted by comments from Ben Kuhn, and I like it so far. Compared to most bloggers, I wasn't exactly light on examples in my early days, but one thing I've noticed is that adding more examples than I would naturally tend to can really clarify things for readers; having "a lot" of examples reduces the rate at which people take away wildly different ideas than the ones I meant. A specific example of this would be, in a post discussing what it takes to get to 95%-ile performance, I only provided a couple examples and many people filled in the blanks and thought that performance that's well above 99.9%-ile is 95%-ile, e.g., that being a chess GM is 95%-ile.

Another example of someone who's made this change is Jamie Brandon. If you read his early posts, such as this one, he often has a compelling idea with a nice turn of phrase, e.g., this bit about when he was working on Eve with Chris Granger:

People regularly tell me that imperative programming is the natural form of programming because 'people think imperatively'. I can see where they are coming from. Why, just the other day I found myself saying, "Hey Chris, I'm hungry. I need you to walk into the kitchen, open the cupboard, take out a bag of bread, open the bag, remove a slice of bread, place it on a plate..." Unfortunately, I hadn't specified where to find the plate so at this point Chris threw a null pointer exception and died.

But, despite having parts that are really compelling, his earlier writing was often somewhat disconnected from the real world in a way that Jamie doesn't love when looking back on his old posts. On adding more details, Jamie says

The point of focusing down on specific examples and keeping things as concrete as possible is a) makes me less likely to be wrong, because non-concrete ideas are very hard to falsify and I can trick myself easily b) makes it more likely that the reader absorbs the idea I'm trying to convey rather than some superficially similar idea that also fits the vague text.

Examples kind of pin ideas down so they can be examined properly.

Another big change, the only one I'm going to discuss here that really qualifies as prose style, is that I try much harder to write things where there's continuity of something that's sometimes called "narrative grammar". This post by Nicola Griffith has some examples of this at the sentence level, but I also try to think about this in the larger structure of my writing. I don't think I'm particularly good at this, but thinking about this more has made my writing easier to follow. This change, especially on larger scales, was really driven by working with a professional editor who's good at spotting structural issues that make writing more difficult to understand. But, at the same time, I don't worry too much if there's a reason that something is difficult to follow. A specific example of this is, if you read answers to questions on ask metafilter or reddit, any question that isn't structurally trivial will have a large fraction of answers that from people who failed to read the question and answer the wrong question, e.g., if someone asks for something that has two parts connected with an and, many people will only read one half of the and and give an answer that's clearly disqualified by the and condition. If many people aren't going to read a short question closely enough to write up an answer that satisfies both halves of an and, many people aren't going to follow the simplest things anyone might want to write. I don't think it's a good use of a writer's time to try to walk someone who can't be bothered with reading both sides of an and through a structured post, but I do think there's value in trying to avoid "narrative grammar" issues that might make it harder for someone who does actually want to read.

Appendix: getting feedback

As we've previously discussed, feedback can greatly facilitate improvement. Unfortunately, the idea from that post, that 95%-ile performance is generally poor, also applies to feedback, making most feedback counterproductive.

I've spent a lot of time watching people get feedback in private channels and seeing how they change their writing in response to it and, at least in the channels that I've looked at (programmers and not professional writers or editors commenting), most feedback is ignored. And when feedback is taken, because almost all feedback is bad and people generally aren't perfect or even very good at picking out good feedback, the feedback that's taken is usually bad.

Fundamentally, most feedback has the issue mentioned in this post and is a form of "you should write it like I would've written it", which generally doesn't work unless the author of the feedback is very careful in how they give the feedback, which few people are. The feedback tends to be superficial advice that misses serious structural issues in writing. Furthermore, the feedback also tends to be "lowest common denominator" feedback that turns nice prose into Strunk-and-White-ified mediocre prose. I don't think that I have a particularly nice prose style, but I've seen a number of people who have a naturally beautiful style ask for feedback from programmers, which has turned their writing into boring prose that anyone could've written.

The other side of this is that when people get what I think is good, substantive, feedback, the most common response is "nah, it's fine". I think of this as the flip side of most feedback being "you should write it how I'd write it". Most people's response to feedback is "I want to write it how I want to write it".

Although this post has focused on how a wide variety of styles can work, it's also true that, given a style and a set of goals, writing can be better or worse. But, most people who are getting feedback don't know enough about writing to know what's better and what's worse, so they can't tell the difference between good feedback and bad feedback.

One way around this is to get feedback from someone whose judgement you trust. As mentioned in the post, the way I did this was by hiring a professional editor whose writing (and editing) I respected.

Another thing I do, one that's a core aspect of my personality and not really about writing, is that I take feedback relatively seriously and try to avoid having a "nah, it's fine" response to feedback. I wouldn't say that this is optimal since I've sometimes spent far too much time on bad feedback, but a core part of how I think is that I'm aware that most people are overconfident and frequently wrong because of their overconfidence, so I don't trust my own reasoning and spend a relatively large amount of time and effort thinking about feedback in an attempt to reduce my rate of overconfidence.

At times, I've spent a comically long amount of time mulling over what is, in retrospect, very bad and "obviously" incorrect feedback that I've been wary of dismissing as incorrect. One thing I've noticed is that, as people gain an audience, some people become more and more confident in themselves and eventually end up becoming highly overconfident. It's easy to see how this happens — as you gain prominence, you'll get more exposure and more "fans" who think you're always right and, on the flip side, you'll also get more "obviously" bad comments.

Back when basically no one read my blog, most of the comments I got were quite good. As I've gotten more and more readers, the percentage of good comments has dropped. From looking at how other people handle this, one common failure mode is that they'll see the massive number of obviously wrong comments that their posts draw and then incorrectly conclude that all of their critics are bozos and that they're basically never wrong. I don't really have an antidote to that other than "take criticism very seriously". Since the failure mode here involves blind spots in judgement, I don't see a simple way to take a particular piece of criticism seriously that doesn't have the potential to result in incorrectly dismissing the criticism due to a blind spot.

Fundamentally, my solution to this has been to avoid looking at most feedback while trying to take feedback from people I trust.

When it comes to issues with the prose, one thing that we discussed above, hiring a professional editor whose writing and editing I respect and deferring to them on issues with my prose worked well.

When it comes to logical soundness or just general interestingness, those are a more difficult to outsource to a single person and I have a set of people whose judgement I trust who look at most posts. If anyone whose judgement I trust thinks a post is interesting, I view that as a strong confirmation and I basically ignore comments that something is boring or uninteresting. For almost all of my posts that are among my top posts in terms of the number of people who told me the post was life changing for them, I got a number of comments from people whose judgement I otherwise think isn't terrible saying that the post seemed boring, pointless, too obvious to write, or just plain uninteresting. I used to take comments that something was uninteresting seriously but, in retrospect, that was a mistake that cost me a lot of time and didn't improve my writing. I think this isn't so different from people who say "write how I write"; instead, it's people who have a similar mental model, but with respect to interesting-ness instead, who can't imagine that other people would find something interesting that they don't. Of course, not everyone's mind works like that, but people who are good at modeling what other people find interesting generally don't leave feedback like "this is boring/pointless", so feedback of that form is almost guaranteed to be worthless.

When it comes to the soundness of an argument, I take the opposite approach that I do for interestingness, in that I take negative comments very seriously and I don't do much about positive comments. I have, sometimes, wasted a lot of time on particular posts because of that. My solution to that has been to try to ignore feedback from people who regularly give bad feedback. That's something I think of as dangerous to do since selectively choosing to ignore feedback is a good way to create an echo chamber, but really seriously taking the time to think through feedback when I don't see a logical flaw is time consuming enough that I don't think there's really another alternative given how I re-evaluate my own work when I get feedback.

One thing I've started doing recently that's made me feel a lot better about this is to look at what feedback people give to others. People who give me bad feedback generally also give other people feedback that's bad in pretty much exactly the same ways. Since I'm not really concerned that I have some cognitive bias that might mislead me into thinking I'm right and their feedback is wrong when it comes to their feedback on other people's writing, instead of spending hours trying to figure out if there's some hole in how I'm explaining something that I'm missing, I can spend minutes seeing that their feedback on someone else's writing is bogus feedback and then see that their feedback on my writing is bogus in exactly the same way.

Appendix: where I get ideas

I often get asked how I get ideas. I originally wasn't going to say anything about this because I don't have much to say, but Ben Kuhn strongly urged me to add this section "so that other people realize what an alien you are".

My feeling is that the world is so full of interesting stuff that ideas are everywhere. I have on the order of a hundred drafts lying around that I think are basically publishable that I haven't prioritized finishing up for one reason or another. If I think of ideas where I've sketched out a post in my head but haven't written it down, the number must well into the thousands. If I were to quit my job and then sit down to write full-time until I died, I think I wouldn't run out of ideas even if I stuck to ones I've already had. The world is big and wondrous and fractally interesting.

For example, I recently took up surf skiing (a kind of kayaking) and I'd say that, after a few weeks, I had maybe twenty or so blog post ideas that I think could be written up for a general audience in the sense that this post on branch prediction is written for a general audience, in that it doesn't assume any hardware background. I could write two posts on different technical aspects of canoe paddle evolution and design as well as two posts on cultural factors and how they impacted the update of different canoe paddle designs. Kayak paddle design has been, in recent history, a lot richer, and that could easily be another five or six posts. The technical aspects of hull design are richer still and could be an endless source of posts, although I only have four particular posts in mind at the moment, but the cultural and historical aspects also seem interesting to me and that's what rounds out the twenty things in my head with respect to that.

I don't have twenty posts on kayaking and canoeing in my head because I'm particularly interested in kayaking and canoeing. Everything seems interesting enough to write twenty posts about. A lot of my posts that exist are part of what might become a much longer series of posts if I ever get around to spending the time to write them up. For example, this post on decision making in baseball was, in my head, the first of a long-ish (10+) post series on decision making that I never got around to writing that I suspect I'll never write because there's too much other interesting stuff to write about and not enough time.

Appendix: other writing about writing

Richard Lanham: Analyzing Prose
- I think it's not easy to take anything directly actionable away from this book, but I found the way that it dissects the rhythm of prose to be really interesting
Robert Alter: The Five Books of Moses
- For the footnotes on why Robert Alter made certain subtle choices in his translation
Francis-Noel Thomas & Mark Turner: Clear and Simple as the Truth
- If you want to write in a clean, authoritative, style
  - People who use this as a manual typically write in an unnuanced fashion with a lot of incorrect statements, but I don't think that's necessary. Also, the writing is often compelling, which many people prefer over nuance anyway; many popular writers in tech use an analogous style
Gary Hoffman & Glynis Hoffman: Adios, Strunk & White: A Handbook for the new Academic Essay
Tracy Kidder & Richard Todd: Good Prose: The Art of Nonfiction
- This book was recommended to me by Kelly Eskridge for its in-depth look at how an editor and a writer interact and I found it useful to keep in mind when working with an editor; reading this book is probably an inefficient way to get a better understanding of what working with a good editor looks like, but it's probably worth reading if you're curious how the author of The Soul of a New Machine writes; if you're not sure what an editor could do for you, this is a nice read
Steve Yegge: You Should Write Blogs
- In particular, for "Reason #3", the Jacob Gabrielson / Zero Config story, although the whole thing is worth reading
Lawrence Tratt: What I’ve Learnt So Far About Writing Research Papers
- Well written, just like everyone else from Lawrence. Also, I think it's interesting in that Lawrence has a completely different process than mine in most major dimensions, but the resultant style is relatively similar if you compare across all programming bloggers (certainly more similar than any of the authors mentioned in the body of this post)
Julia Evans: How I write useful programming comics
- Nice explanation of what makes Julia's zines tick; also completely different from my approach, but this time with a completely different result
Yossi Kreinen: Blogging is hard (the title is a contrast to his next post, "low level is easy")
- A rare example of a first post that's basically "I'm going to write a blog" that's both interesting and has interesting future posts that follow; also Yossi's writing philosophy
Phil Eaton: What makes a great technical blog
- A brief summary of properties that Phil likes in technical blogs. It's sort of the opposite of what people usually take away from Thomas and Turner's Clear and Simple as the Truth

Appendix: things that increase popularity that I generally don't do

Here are some things that I think work based on observing what works for other people that I don't do, but if you want a broad audience, perhaps you can try some of them out:

Use clickbait titles
- Swearing or saying that something "is cancer" or "is the Vietnam of X" or some other highly emotionally loaded phrase seems to be particularly effective
Talk-up prestige/accomplishments/titles
Use an authoritative tone and/or style
Write things with an angry tone or that are designed to induce anger
Write frequently
Get endorsements from people
Write about hot, current, topics
- Provide takes on recent events
Use deliberately outrageous / controversial framings on topics

Appendix: some snippets of writing

In case you're not familiar with the writers mentioned, here are some snippets that I think are representative of their writing styles:

Joel Spolsky:

Why I really care is that Microsoft is vacuuming up way too many programmers. Between Microsoft, with their shady recruiters making unethical exploding offers to unsuspecting college students, and Google (you're on my radar) paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex and walk around trying to get someone...anyone...to come see the demo code they've just written with their "20% time," doing some kind of, let me guess, cloud-based synchronization... between Microsoft and Google the starting salary for a smart CS grad is inching dangerously close to six figures and these smart kids, the cream of our universities, are working on hopeless and useless architecture astronomy because these companies are like cancers, driven to grow at all cost, even though they can't think of a single useful thing to build for us, but they need another 3000-4000 comp sci grads next week. And dammit foosball doesn't play itself.

and

When I started interviewing programmers in 1991, I would generally let them use any language they wanted to solve the coding problems I gave them. 99% of the time, they chose C. Nowadays, they tend to choose Java ... Java is not, generally, a hard enough programming language that it can be used to discriminate between great programmers and mediocre programmers ... Nothing about an all-Java CS degree really weeds out the students who lack the mental agility to deal with these concepts. As an employer, I’ve seen that the 100% Java schools have started churning out quite a few CS graduates who are simply not smart enough to work as programmers on anything more sophisticated than Yet Another Java Accounting Application, although they did manage to squeak through the newly-dumbed-down coursework. These students would never survive 6.001 at MIT, or CS 323 at Yale, and frankly, that is one reason why, as an employer, a CS degree from MIT or Yale carries more weight than a CS degree from Duke, which recently went All-Java, or U. Penn, which replaced Scheme and ML with Java

Paul Graham:

A couple years ago a venture capitalist friend told me about a new startup he was involved with. It sounded promising. But the next time I talked to him, he said they'd decided to build their software on Windows NT, and had just hired a very experienced NT developer to be their chief technical officer. When I heard this, I thought, these guys are doomed. One, the CTO couldn't be a first rate hacker, because to become an eminent NT developer he would have had to use NT voluntarily, multiple times, and I couldn't imagine a great hacker doing that; and two, even if he was good, he'd have a hard time hiring anyone good to work for him if the project had to be built on NT.

and

What sort of people become haters? Can anyone become one? I'm not sure about this, but I've noticed some patterns. Haters are generally losers in a very specific sense: although they are occasionally talented, they have never achieved much. And indeed, anyone successful enough to have achieved significant fame would be unlikely to regard another famous person as a fraud on that account, because anyone famous knows how random fame is.

Steve Yegge:

When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: "Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course." Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.

This is a wonderful book about how to write good code, and there aren't many books like it. None, maybe. They don't typically teach you how to write good code in school, and you may never learn on the job. It may take years, but you may still be missing some key ideas. I certainly was. ... If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!

and

Jeff Bezos is an infamous micro-manager. He micro-manages every single pixel of Amazon's retail site. He hired Larry Tesler, Apple's Chief Scientist and probably the very most famous and respected human-computer interaction expert in the entire world, and then ignored every goddamn thing Larry said for three years until Larry finally -- wisely -- left the company. Larry would do these big usability studies and demonstrate beyond any shred of doubt that nobody can understand that frigging website, but Bezos just couldn't let go of those pixels, all those millions of semantics-packed pixels on the landing page. They were like millions of his own precious children. So they're all still there, and Larry is not.

Micro-managing isn't that third thing that Amazon does better than us, by the way. I mean, yeah, they micro-manage really well, but I wouldn't list it as a strength or anything. I'm just trying to set the context here, to help you understand what happened. We're talking about a guy who in all seriousness has said on many public occasions that people should be paying him to work at Amazon. He hands out little yellow stickies with his name on them, reminding people "who runs the company" when they disagree with him. The guy is a regular... well, Steve Jobs, I guess. Except without the fashion or design sense. Bezos is super smart; don't get me wrong. He just makes ordinary control freaks look like stoned hippies.

Julia Evans:

Right now I’m on a million-hour train ride from New York to Montreal. So I’m looking at the output of strace because, uh, strace is cool, and it is teaching me some things about how the command line tools I use all the time work.

What strace does is capture every single system call that gets called when executing a program. System calls are the interface between userspace programs and the kernel, so looking at the output from strace is a fun way to understand how Linux works, and what’s really involved in running a program.

For example! killall! I ran

strace killall ruby1.9.1 2> killall-log.

Appendix: how GPT-4o thinks people write

[2024 update]

I've seen quite a few people fawn over GPT-3, GPT-3.5's, and GPT-4's ability to "imitiate" somebody and create work that's in their style, allowing you to read hypotheetical essays or have hypothetical conversations with someone. I haven't found this to work very well in general. Below are examples of what you get with the prompt "Generate a 200 word quote about Microsoft that's something X might write." (I tried other prompts and to equally bad results, in that the content, tone, and style are not much like what any of these people actually produce):

Joel Spolsky:

Microsoft is a study in the complex interplay between engineering excellence and business acumen. They're not just a company that builds software; they're a sprawling, multi-tentacled beast that has, over decades, managed to insert itself into almost every corner of the tech industry. They’ve had their missteps—anyone remember Clippy or the Zune?—but they’ve also had spectacular successes. The magic of Microsoft lies in their ability to pivot and adapt, often turning what seems like a colossal blunder into an integral part of their strategy.

Their initial dominance was built on Windows and Office, a formidable combo that defined personal and enterprise computing for decades. But what’s really fascinating is how they’ve evolved. Under Nadella, Microsoft transformed from a lumbering giant into a nimble innovator, diving headfirst into the cloud with Azure and embracing the open-source community they once shunned. They’ve created a developer ecosystem that’s robust and, dare I say, fun to work in. Microsoft is not just about software anymore; it's about platforms, services, and an ever-growing universe of connected tools. They’re proof that even the biggest companies can stay relevant if they’re willing to rethink and reinvent themselves continually

Paul Graham:

Microsoft's journey is a quintessential example of how adaptability trumps inertia in the tech world. In its early days, the company epitomized the startup ethos of seizing opportunities, transforming from a scrappy two-man team into a juggernaut by outmaneuvering larger, more established competitors. They mastered the art of commoditizing software and built an ecosystem that turned personal computing into a universal reality. Yet, success bred complacency. For years, Microsoft clung to its Windows and Office cash cows, missing the mobile and internet revolutions as younger, nimbler rivals reshaped the industry. But Microsoft’s story didn't end there. Under Satya Nadella’s visionary leadership, the company rediscovered its startup roots, pivoting towards the cloud and embracing open source with a zeal reminiscent of its early days. This ability to transform, to reimagine itself in the face of profound industry shifts, is what keeps Microsoft relevant. It's a lesson to every startup and tech giant alike: the key to longevity is not just innovation but the relentless pursuit of reinvention. In Microsoft's world, the only constant is change, and the company's true genius lies in its unwavering ability to adapt and thrive.

Microsoft is a fascinating entity to explore, especially if you love to dive into the details of technology and see how they scale complexity. For a long time, they were the behemoth that you couldn’t avoid if you were using a computer. Windows and Office were like the air you breathed—ubiquitous and essential. But what's really interesting is how Microsoft has evolved. They’ve gone from being the gatekeepers of the desktop to pioneers in the cloud and champions of open-source.

Think about Azure. It’s this massive infrastructure that powers everything from small startups to large enterprises, enabling developers to deploy code globally with just a few clicks. And then there’s their embrace of open source. Remember when they acquired GitHub? That was a signal that Microsoft was serious about supporting the developer community. It wasn’t just a PR move; they’re contributing to and even leading projects that millions of us rely on.

Their tools have become indispensable for so many of us. Visual Studio Code, for instance, is a joy to use. It’s light, fast, and extensible—a perfect example of how Microsoft has learned to build products that developers love. It’s a company that’s not afraid to reinvent itself and, in doing so, makes our tech lives better and more interesting.

Thanks to Yossi Kreinin, Ben Kuhn, Laurence Tratt, Heath Borders, Jamie Brandon, Julia Evans, Vegard Nossum, Julien Kirch, Bram Delver, and Pam Wolf for comments/corrections/discussion.

What's worked can mean very different things for different people, but for this section we're going to look at popular blogs because, when people I know have frustratedly stopped writing after writing a blog for a while, the most common reason has been that their blog had basically no readers.

Of course, many people write without a goal of having readers and some people even try to avoid having more than a few readers (by "locking" posts in some way so that only "friends" have access) but, I don't think the idea that "what works" is very broad and that many different styles can work changes if the goal is to have just a few friends read a blog.
^[return]
This is pretty arbitrary. In other social circles, Jeff Atwood, Raymond Chen, Scott Hanselman, etc., might be on the list, but this wouldn't change the point since all of these folks also have different styles from each other as well as the people on my list. ^[return]
2017 is the endpoint since I reduced how much I pay attention to programming internet culture around then and don't have a good idea on what people I know were reading after 2017. ^[return]
In sports, elite coaches that have really figured out how to cue people to do the right thing can greatly accelerate learning but, outside of sports, although there's no shortage of people who are willing to supply coaching, it's rare to find one who's really figured out what cues students can be given that will help them get to the right thing much more quickly than they would've if they just naively measured what they were doing and applied a bit of introspection. ^[return]
It turns out that blogging has been pretty great for me (e.g., my blog got me my current job, facilitated meeting a decent fraction of my friends, results in people sending me all sorts of interesting stories about goings-on in the industry, etc.), but I don't think that was a predictable outcome before starting the blog. My guess, based on base rates, was that the most likely outcome was failure. ^[return]
Such as this comment on how cushy programming jobs are compared to other lucrative jobs (which turned into the back half of this post on programmer compensation, this comment on writing pay, and this comment on the evolution of board game design. ^[return]

Some latency measurement pitfalls

Mon, 06 Dec 2021 00:00:00 +0000

This is a pseudo-transcript (actual words modified to be more readable than a 100% faithful transcription) of a short lightning talk I did at Twitter a year or two ago, on pitfalls of how we use latency metrics (with the actual service names anonymized per a comms request). Since this presentation, significant progress has been made on this on the infra side, so the situation is much improved over what was presented, but I think this is still relevant since, from talking to folks at peer companies, many folks are facing similar issues.

We frequently use tail latency metrics here at Twitter. Most frequently, service owners want to get cluster-wide or Twitter-wide latency numbers for their services. Unfortunately, the numbers that service owners tend to use differ from what we'd like to measure due some historical quirks in our latency measurement setup:

Opaque, uninstrumented, latency
Lack of, cluster-wide, aggregation capability
Minutely resolution

Opaque, uninstrumented, latency

When we look at the dashboards for most services, the latency metrics that are displayed and are used for alerting are usually from the server the service itself is running on. Some services that have dashboards set up by senior SREs who've been burned by invisible latency before will also have the service's client-observed latency from callers of the service. I'd like to discuss three issues with this setup.

For the purposes of this talk, we can view a client request as passing through the following pipeline after client "user" code passes the request to our RPC layer, Finagle(https://twitter.github.io/finagle/), and before client user code receive the response (the way Finagle currently handles requests, we can't get timestamps for a particular request once the request is handled over to the network library we use, netty

client netty -> client Linux -> network -> server Linux -> server netty -> server "user code" -> server netty -> server Linux -> network -> client Linux -> client netty

As we previously saw in [an internal document quantifying the impact of CFS bandwidth control throttling and how our use of excessively large thread pools causes throttling]¹, we frequently get a lot of queuing in and below netty, which has the knock-off effect of causing services to get throttled by the kernel, which often results in a lot of opaque latency, especially when under high load, when we most want dashboards to show correct latency numbers..

When we sample latency at the server, we basically get latency from

Server service "user" code

When we sample latency at the client, we basically get

Server service "user" code
Server-side netty
Server-side Linux latency
Client-side Linux latency
Client-side netty latency

Two issues with this are that we don't, with metrics data, have a nice way to tell if latency is in the opaque parts of the stack are coming from the client or the server. As a service owner, if you set alerts based on client latency, you'll get alerted when client latency rises because there's too much queuing in netty or Linux on the client even when your service is running smoothly.

Also, the client latency metrics that are reasonable to look at given what we expose give you latency for all servers a client talks to, which is a really different view from what we see on server metrics, which gives us per-server latency numbers and there isn't a good way to aggregate per-server client numbers across all clients, so it's difficult to tell, for example, if a particular instance of a server has high latency in netty.

Below are a handful examples of cluster-wide measurements of latency measured at the client vs. the server. These were deliberately selected to show a cross-section of deltas between the client and the server.

This is a CDF, presented with the standard orientation for a CDF, with the percentile is on the y-axis and the value on the x-axis, which makes down and to the right higher latency and up and to the left lower latency, and a flatter line meaning latency is increasing quickly and a steeper line meaning that latency is increasing more slowly.

Because the chart is log scale on both axes, the difference between client and server latency is large even though the lines don't look all that far apart. For example, if we look at 99%-ile latency, we can see that it's ~16ms when measured at the server and ~240ms when measured at the client, a factor of 15 difference. Alternately, if we look at a fixed latency, like 240ms, and look up the percentile, we see that's 99%-ile latency on the client, but well above 99.9%-ile latency on the server.

The graphs below have similar properties, although the delta between client and server will vary.

We can see that latencies often differ significantly when measured at the client vs. when measured at the server and that, even in cases where the delta is small for lower percentiles, it sometimes gets large at higher percentiles, where more load can result in more queueing and therefore more latency in netty and the kernel.

One thing to note is that, for any particular measured server latency value, we see a very wide range of client latency values. For example, here's a zoomed in scatterplot of client vs. server latency for service-5. If we were to zoom out, we'd see that for a request with a server-measured latency of 10ms, we can see client-measured latencies as high as 500ms. More generally, we see many requests where the server-measured latency is very similar to the client-measured latency, with a smattering of requests where the server-measured latency is a very inaccurate representation of the client-measured latency. In almost all of those cases, the client-measured latency is higher due to queuing in a part of the stack that's opaque to us and, in a (very) few cases, the client-measured latency is lower due to some issues in our instrumentation. In the plot below, due to how we track latencies, we only have 1ms granularity on latencies. The points on the plots below have been randomly jittered by +/- 0.4ms to give a better idea of the distribution at points on the plot that are very dense.

While it's possible to plumb instrumentation through netty and the kernel to track request latencies after Finagle has handed them off (the kernel even has hooks that would make this somewhat straightforward), that's probably more work than is worth it in the near future. If you want to get an idea for how your service is impacted by opaque latency, it's fairly easy to get a rough idea with Zipkin if you leverage the work Rebecca Isaacs, Jonathan Simms, and Rahul Iyer have done, which is how I generated the plots above. The code for these is checked into [a path in our monorepo] and you can plug in your own service names if you just want to check out a different service.

Lack of cluster-wide aggregation capability

In the examples above, we were able to get cluster-wide latency percentiles because we used data from Zipkin, which attempts to sample requests uniformly at random. For a variety of reasons, service owners mostly rely on metrics data which, while more complete because it's unsampled, doesn't let us compute cluster-wide aggregates because we pre-compute fixed aggregations on a per-shard basis and there's no way to reconstruct the cluster-wide aggregate from the per-shard aggregates.

From looking at dashboards of our services, the most common latency target is a per-shard average of shard-level 99%-ile latency (with some services that are deep in the request tree, like cache, using numbers further in the tail). Unfortunately, taking the average of per-shard tail latency defeats the purpose of monitoring tail latency. If we think about why we want to use tail latency because, when we have high fanout and high depth request trees, a very small fraction of server responses slowing down can slow down many or most top-level requests, taking the average of tail latency fails to capture the value of using tail latency since the average of shard-level tail latencies fails to capture the property that a small fraction of server responses being slow can slow down many or most requests while also missing out on the advantages of looking at cluster-wide averages, which can be reconstructed from per-shard averages.

For example, when we have a few bad nodes returning , that has a small impact on the average per-shard tail latency even though cluster-wide tail latency will be highly elevated. As we saw in [a document quantifying the extent of machine-level issues across the fleet as well as the impact on data integrity and performance]², we frequently have host-level issues that can drive tail latency on a node up by one or more orders of magnitude, which can sometimes drive median latency on the node up past the tail latency on other nodes. Since a few or even one such node can determine the tail latency for a cluster, taking the average across all nodes can be misleading, e.g., if we have a 100 node cluster where tail latency is up by 10x on one node, this might cause our average of cluster-wide latencies to increase by a factor of 0.99 + 0.01 * 10 = 1.09 when the actual increase in tail latency is much larger.

Some service owners try to get a better approximation of cluster-wide tail latency by taking a percentile of the 99%-ile, often the 90%-ile or the 99%-ile, but this doesn't work either and there is, in general, no per-shard percentile or other aggregation of per-shard tail latencies that can reconstruct the cluster-level tail latency.

Below are plots of the various attempts that people have on dashboards to get cluster-wide latency with instance-level metrics data vs. actual (sampled) cluster-wide latency on a service which makes the percentile of percentile attempts more accurate than for smaller services. We can see the correlation is very weak and has the problem we expect, where the average of the tail isn't influenced by outlier shards as much as it "should be" and the various commonly used percentiles either aren't influenced enough or are influenced too much, on average and are also weakly correlated with the actual latencies. Because we track metrics with minutely granularity, each point in the graphs below represents one minute, with the sampled cluster-wide p999 latency on the x-axis and the dashboard aggregated metric value on the y-axis. Because we have 1ms granularity on individual latency measurements from our tracing pipeline, points are jittered horizontally +/- 0.3ms to give a better idea of the distribution (no such jitter is applied vertically since we don't have this limitation in our metrics pipeline, so that data is higher precision).

The correlation between cluster-wide latency and aggregations of per-shard latency is weak enough that even if you pick the aggregation that results in the correct average behavior, the value will still be quite wrong for almost all samples (minutes). Given our infra, the only solutions that can really work here are extending our tracing pipeline for use on dashboards and with alerts or adding metric histograms to Finagle and plumbing that data up through everything and the into [dashboard software] so that we can get proper cluster-level aggregations³.

While it's popular to take the average of tail latencies because it's easy and people are familiar with it (e.g., the TL of observability at [redacted peer company name] has said that they shouldn't bother with anything other than averages because everyone just wants averages), taking the average or another aggregation of shard-level tail latencies has neither the properties people want nor the properties people expect.

Minutely resolution

Another, independent, issue that's a gap in our ability to observe what's going on with our infrastructure is that we only collect metrics at a minutely granularity. Rezolus does metrics collection on a secondly (and in some cases, even sub-secondly) granularity, but for reasons that are beyond the scope of this talk, it's generally only used for system-level metrics (with a few exceptions).

We've all seen incidents where some bursty, sub-minutely event, is the cause of a problem. Let's look at an example of one such incident. In this incident, a service had elevated latency and error rate. Looking at the standard metrics we export wasn't informative, but looking at sub-minutely metrics immediately reveals a clue:

For this particular shard of a cache (and many others, not shown), there's a very large increase in latency at time 0, followed by 30 seconds of very low request rate. The 30 seconds is because shards of service-6 were configured to mark servers they talk to as dead for 30 seconds if service-6 clients encounter too many failed requests. This decision is distributed, which is why the request rate to the impacted shard of cache-1 isn't zero; some shards of service-6 didn't send requests to that particular shard of cache-1 during during the period of elevated latency, so they didn't mark that shard of cache-1 as dead and continued to issue requests.

A sub-minutely view of request latency made it very obvious what mechanism caused elevated error rates and latency in service-6.

One thing to note is that the lack of sub-minutely visibility wasn't the only issue here. Much of the elevated latency was in places that are invisible to the latency metric, resulting in monitoring cache-1 latencies insufficient to detect the issue. Below, the reported latency metrics for a single instance of cache-1 are the blue points and the measured (sampled) latency the client observed is the black line⁴. Reported p99 latency is 0.37ms, but actual p99 latency is ~580ms, an over three order of magnitude difference.

Summary

Although our existing setup for reporting and alerting on latency works pretty decently, in that the site generally works and our reliability is actually quite good compared to peer companies in our size class, we do pay some significant costs as a result of our setup.

One is that we often have incidents where it's difficult to see what's going on without using tools that are considered specialized that most people don't use, adding to the toil of being on call. Another is that, due to large margins of error in our estimates of cluster-wide latencies, we have to have to provision a very large amount of slack and keep latency SLOs that are much stricter than the actual latencies we want to achieve to avoid user-visible incidents. This increases operating costs as we've seen in [a document comparing per-user operating costs to companies that serve similar kinds of and levels of traffic].

If you enjoyed this post you might like to read about tracing on a single host vs. sampling profilers.

Appendix: open vs. closed loop latency measurements

Some of our synthetic benchmarking setups, such as setup-1, use "closed-loop" measurement, where they effectively send a single request, wait for it to come back, and then send another request. Some of these allow for a degree of parallelism, where N request can be in flight at once but that still has similar problems in terms of realism.

For a toy example of the problem, let's say that we have a service that, in production, receives exactly 1 request every second and that the service has a normal response time of 1/2 second. Under normal behavior, if we issue requests at 1 per second, we'll observe that the mean, median, and all percentile request times are 1/2 second. As an exercise for the reader, compute the mean and 90%-ile latency if the service has no parallelism and one request takes 10 seconds in the middle of a 1 minute benchmark run for a closed vs. open loop benchmark setup where the benchmarking setup issues requests at 1 per second for the open loop case, and 1 per second but waits for the previous request to finish in the closed loop case.

For more info on this, see Nitsan Wakart's write-up on fixing this issue in the YCSB benchmark or Gil Tene's presentation on this issue.

Appendix: use of unweighted averages

An common issue with averages on dashboards that I've looked at that's independent of the issues that come up when we take the average of tail latencies is that an unweighted average frequently underestimates the actual latency.

Two places I commonly see an unweighted average are when someone gets an overall latency by taking an unweighted average across datacenters and when someone gets a cluster-wide latency by taking an average across shards. Both of these have the same issue, that shards that have lower load tend to have lower latency. This is especially pronounced when we fail away from a datacenter. Services that incorrectly use an unweighted average across datacenters will often show decreased latency even though actually served requests have increased latency.

Thanks to Ben Kuhn for comments/corrections/discussion.

This is another item that's somewhat out of date, since this document motivated work from Flavio Brasil and Vladimir Kostyukov to do work on Finagle that reduces the impact of this problem and then, later, work from my then-intern, Xi Yang, on a patch to the kernel scheduler that basically eliminates the problem by preventing cgroups from exceeding their CPU allocation (as opposed to the standard mechanism, which allows cgroups to exceed their allocation and then effectively puts the cgroup to sleep until its amortized cpu allocation is no longer excessive, which is very bad for tail latency). ^[return]
This is yet another item that's out of date since the kernel, HWENG, and the newly created fleet health team have expended significant effort to drive down the fraction of unhealthy machines. ^[return]
This is also significantly out of date today. Finagle does now support exporting shard-level histogram data and this can be queried via one-off queries by hitting the exported metrics endpoint. ^[return]
As we previously noted, opaque latency could come from either the server or the client, but in this case, we have strong evidence that the latency is coming from the cache-1 server and not the service-6 client because opaque latency from the service-6 client should be visible on all requests from service-6 but we only observe elevated opaque latency on requests from service-6 to cache-1 and not to the other servers it "talks to". ^[return]

Major errors on this blog (and their corrections)

Mon, 22 Nov 2021 00:00:00 +0000

Here's a list of errors on this blog that I think were fairly serious. While what I think of as serious is, of course, subjective, I don't think there's any reasonable way to avoid that because, e.g., I make a huge number of typos, so many that the majority of acknowledgements on many posts are for people who e-mailed or DM'ed me typo fixes.

A list that included everything, including typos would both be uninteresting for other people to read as well as high overhead for me, which is why I've drawn the line somewhere. An example of an error I don't think of as serious is, in this post on how I learned to program, I originally had the dates wrong on when the competition programmers from my high school made money (it was a couple years after I thought it was). In that case, and many others, I don't think that the date being wrong changes anything significant about the post.

Although I'm publishing the original version of this in 2021, I expect this list to grow over time. I hope that I've become more careful and that the list will grow more slowly in the future than it has in the past, but that remains to be seen. I view it as a good sign that a large fraction of the list is from my first three months of blogging, in 2013, but that's no reason to get complacent!

I've added a classification below that's how I think of the errors, but that classification is also arbitrary and the categories aren't even mutually exclusive. If I ever collect enough of these that it's difficult to hold them all in my head at once, I might create a tag system and use that to classify them instead, but I hope to not accumulate so many major errors that I feel like I need a tag system for readers to easily peruse them.

Insufficient thought
- 2013: Using random algorithms to decrease the probability that good stories get "unlucky" on HN: this idea was tried and didn't work well as well as putting humans in the loop who decide which stories should be rescued from oblivion.
  - Since this was a proposal and not a claim, this technically wasn't an error since I didn't claim that this would definitely work, but my feeling is that I should've also considered solutions that put humans in the loop. I didn't because Digg famously got a lot of backlash for having humans influence their front page but, in retrospect, we can see that it's possible to do so in a way that doesn't generate backlash that effective kills the site and I think this could've been predicted with enough thought
Naivete
- 2013: The institution knowledge and culture that create excellence can take a long time to build up: At this time, I hadn't worked in software and that thought that this wasn't as difficult for software because so many software companies are successful with new/young teams. But, in retrospect, the difference isn't that those companies don't produce bad (unreliable, buggy, slow, etc.) software, it's that product/market fit and network effects are important enough that it frequently doesn't matter that software is bad
- 2015: In this post on how people don't read citations, I found it mysterious that type system advocates would cite non-existent strong evidence, which seems unlike the other examples, where people pass on a clever, contrarian, result without ever having read it. The thing I thought was mysterious was that, unlike the other examples, there isn't an incorrect piece of evidence being passed around; the assertion that there is evidence is disconnected from any evidence, even misinterpreted evidence. In retrospect, I was being naive in thinking that there was a link to evidence that people wouldn't just fabricate the idea that there is evidence supporting their belief and then pass that around.
Insufficient verification of information
- 2016: Building a search engine isn't trivial: although I think the overall point is true, one of the pieces of evidence I relied on came out of using numbers that someone who worked on a search engine told me about. But when I measured actual numbers, I found that the numbers I was told were off by multiple orders of magnitude
- 2022: Futurist predictions, pointed out to me by @ESRogs: I misread nostalgebraist's summary of a report and didn't understand what he was saying with respect to a sensitivity analysis he was referring to. I distinctly remember not being sure what nostaglebraist was saying and originally agreed with the correct interpretation. After re-reading it, I came away with my mistaken reading, which I then wrote into my post. That I had uncertainty about the reading should've caused me to just reproduce his analysis, which would have immediately clarified what he meant, but I didn't do that. This error didn't fundamentally change my own analysis since the broader point I was making didn't hinge on the exact numbers, but I think it's a very bad habit to allow yourself to publish something with the level of uncertainty I had without noting the uncertainty (quite an ironic mistake considering the contents of the post itself). A factor that both led to the mistake in the first place as well as to not checking the math in a way that would've spotted the mistake is that the edits that introduced this mistake were a last-minute change introduced when I had a short window of time to make the changes if I wanted to publish immediately and not some time significantly later. Of course, that should have led to me delaying publication, so this was one bad decision that led to another
Blunder
- 2015: Checking out Butler Lampson's review of what worked in CS, 16 years later: it was wrong to say that capabilities were a "no" in 2015 given their effectiveness on mobile and that seems so obviously wrong at the time that I would call this a blunder rather than something where I gave it a decent amount of thought but should've thought through it more deeply
- 2024: Diseconomies of scale: I mixed up which number I was dividing by which when doing arithmetic, causing a multiple order of magnitude error in a percentage. Sophia Wisdom noticed this a few hours after the post was published and I fixed it immediately , but this quite a silly error.
Pointlessly difficult to understand explanation
- 2013: How data alignment impacts memory latency: the main plots in this post use a ratio of latencies, which adds a level of indirection that many people found confusing
- 2017: It is easy to achieve 95%-ile performance: the most common objection people had to this post was something like "False. You need to be very talented and/or it is hard to [play in the NBA / become a chess GM / achieve a 2200 chess rating]". James Clear made an even weaker claim on Twitter and also got similar responses. There isn't really space to do this on Twitter, but in my blog post, I should've included more concrete examples of what various levels of performance look like for people who have a difficult time estimating what performance looks like at various percentiles. To pick one of the less outlandish claims, here's a claim that a 2200 rating is 95%-ile for someone who's ever played chess online, which appears to be off by perhaps four orders of magnitude, plus or minus one.
Errors in retrospect
- 2015: Blog monetization: I grossly underestimated how much I could make on Patreon by looking at how much Casey Muratori, Eric Raymond, and eevee were making on Patreon at the time. I thought that all three of them would out-earn me based for a variety of reasons and that was incorrect. A major reason that was incorrect was that boring, long-form, writing monetizes much better than I exepected, which means that I monetarily undervalued that compared to what other tech folks are doing.
  - A couple weeks ago, I added a link to Patreon at the top of posts (instead of just having one hidding at the bottom) and mentioned having a Patreon on Twitter. Since then, my earnings have increased by about as much as Eric Raymond makes in total and the amount seems to be increasing at a decent rate, which is a result I wouldn't have expected before the rise of substack. But anyone who realized how well individual writers can monetize their writing could've created substack and no one did until Chris Best, Hamish McKenzie, and Jairaj Sethi created substack, so I'd say that this one was somewhat non-obvious. Also, it's unclear if the monetization is going to scale up or will plateau; if it plateaus, then my guess would only be off by a small constant factor.

Thanks to Anja Boskovic and Ville Sundberg for comments/corrections/discussion.

Individuals matter

Mon, 15 Nov 2021 00:00:00 +0000

One of the most common mistakes I see people make when looking at data is incorrectly using an overly simplified model. A specific variant of this that has derailed the majority of work roadmaps I've looked at is treating people as interchangeable, as if it doesn't matter who is doing what, as if individuals don't matter.

Individuals matter.

A pattern I've repeatedly seen during the roadmap creation and review process is that people will plan out the next few quarters of work and then assign some number of people to it, one person for one quarter to a project, two people for three quarters to another, etc. Nominally, this process enables teams to understand what other teams are doing and plan appropriately. I've never worked in an organization where this actually worked, where this actually enabled teams to effectively execute with dependencies on other teams.

What I've seen happen instead is, when work starts on the projects, people will ask who's working the project and then will make a guess at whether or not the project will be completed on time or in an effective way or even be completed at all based on who ends up working on the project. "Oh, Joe is taking feature X? He never ships anything reasonable. Looks like we can't depend on it because that's never going to work. Let's do Y instead of Z since that won't require X to actually work". The roadmap creation and review process maintains the polite fiction that people are interchangeable, but everyone knows this isn't true and teams that are effective and want to ship on time can't play along when the rubber hits the road even if they play along with the managers, directors, and VPs, who create roadmaps as if people can be generically abstracted over.

Another place the non-fungibility of people causes predictable problems is with how managers operate teams. Managers who want to create effective teams¹ end up fighting the system in order to do so. Non-engineering orgs mostly treat people as fungible, and the finance org at a number of companies I've worked for forces the engineering org to treat people as fungible by requiring the org to budget in terms of headcount. The company, of course, spends money and not "heads", but internal bookkeeping is done in terms of "heads", so $X of budget will be, for some team, translated into something like "three staff-level heads". There's no way to convert that into "two more effective and better-paid staff level heads"². If you hire two staff engineers and not a third, the "head" and the associated budget will eventually get moved somewhere else.

One thing I've repeatedly seen is that a hiring manager will want to hire someone who they think will be highly effective or even just someone who has specialized skills and then not be able to hire because the company has translated budget into "heads" at a rate that doesn't allow for hiring some kind of heads. There will be a "comp team" or other group in HR that will object because the comp team has no concept of "an effective engineer" or "a specialty that's hard to hire for"; for a person, role, level, and location defines them and someone who's paid too much for their role and level is therefore a bad hire. If anyone reasonable had power over the process that they were willing to use, this wouldn't happen but, by design, the bureaucracy is set up so that few people have power³.

A similar thing happens with retention. A great engineer I know who was regularly creating $x0M/yr⁴ of additional profit for the company per year wanted to move home to Portugal, so the company cut the person's cash comp by a factor of four. The company also offered to only cut his cash comp by a factor of two if he moved to Spain instead of Portugal. He left for a company that doesn't have location-based pay. This was escalated up to the director level, but that wasn't sufficient to override HR, so they left. HR didn't care that the person made the company more money than HR saves by doing location adjustments for all international employees combined because HR at the company had no notion of the value of an employee, only the cost, title, level, and location⁵.

Relatedly, a "move" I've seen twice, once from a distance and once from up close, is when HR decides attrition is too low. In one case, the head of HR thought that the company's ~5% attrition was "unhealthy" because it was too low and in another, HR thought that the company's attrition sitting at a bit under 10% was too low. In both cases, the company made some moves that resulted in attrition moving up to what HR thought was a "healthy" level. In the case I saw from a distance, folks I know at the company agree that the majority of the company's best engineers left over the next year, many after only a few months. In the case I saw up close, I made a list of the most effective engineers I was aware of (like the person mentioned above who increased the company's revenue by 0.7% on his paternity leave) and, when the company successfully pushed attrition to over 10% overall, the most effective engineers left at over double that rate (which understates the impact of this because they tended to be long-tenured and senior engineers, where the normal expected attrition would be less than half the average company attrition).

Some people seem to view companies like a game of SimCity, where if you want more money, you can turn a knob, increase taxes, and get more money, uniformly impacting the city. But companies are not a game of SimCity. If you want more attrition and turn a knob that cranks that up, you don't get additional attrition that's sampled uniformly at random. People, as a whole, cannot be treated as an abstraction where the actions company leadership takes impacts everyone in the same way. The people who are most effective will be disproportionately likely to leave if you turn a knob that leads to increased attrition.

So far, we've talked about how treating individual people as fungible doesn't work for corporations but, of course, it also doesn't work in general. For example, a complaint from a friend of mine who's done a fair amount of "on the ground" development work in Africa is that a lot of people who are looking to donate want, clear, simple criteria to guide their donations (e.g., an RCT showed that the intervention was highly effective). But many effective interventions cannot have their impact demonstrated ex ante in any simple way because, among other reasons, the composition of the team implementing the intervention is important, resulting in a randomized trial or other experiment not being applicable to team implementing the intervention other than the teams from the trial in the context they were operating in during the trial.

An example of this would be an intervention they worked on that, among other things, helped wipe out guinea worm in a country. Ex post, we can say that was a highly effective intervention since it was a team of three people operating on a budget of $12/(person-day)⁶ for a relatively short time period, making it a high ROI intervention, but there was no way to make a quantitative case for the intervention ex ante, nor does it seem plausible that there could've been a set of randomized trials or experiments that would've justified the intervention.

Their intervention wasn't wiping out guinea worm, that was just a side effect. The intervention was, basically, travelling around the country and embedding in regional government offices in order to understand their problems and then advise/facilitate better decision making. In the course of talking to people and suggesting improvements/changes, they realized that guinea worm could with better distribution of clean water (guinea worm can come from drinking unfiltered water; giving people clean water can solve that problem) and that aid money flowing into the country specifically for water-related projects, like building wells, was already sufficient if the it was distributed to places in the country that had high rates of guinea worm due to contaminated water instead of to the places aid money was flowing to (which were locations that had a lot of aid money flowing to them for a variety of reasons, such as being near a local "office" that was doing a lot of charity work). The specific thing this team did to help wipe out guinea worm was to give powerpoint presentations to government officials on how the government could advise organizations receiving aid money on how those organizations could more efficiently place wells. At the margin, wiping out guinea worm in a country would probably be sufficient for the intervention to be high ROI, but that's a very small fraction of the "return" from this three person team. I only mention it because it's a self-contained easily-quantifiable change. Most of the value of "leveling up" decision making in regional government offices is very difficult to quantify (and, to the extent that it can be quantified, will still have very large error bars).

Many interventions that seem the same ex ante, probably even most, produce little to no impact. My friend has a lot of comments on organizations that send a lot of people around to do similar sounding work but that produce little value, such as the Peace Corps.

A major difference between my friend's team and most teams is that my friend's team was composed of people who had a track record of being highly effective across a variety of contexts. In an earlier job, my friend started a job at a large-ish ($5B/yr revenue) government-run utility company and was immediately assigned a problem that, unbeknownst to her, had been an open problem for years that was considered to be unsolvable. No one was willing to touch the problem, so they hired her because they wanted a scapegoat to blame and fire when the problem blew up. Instead, she solved the problem she was assigned to as well as a number of other problems that were considered unsolvable. A team of three such people will be able to get a lot of mileage out of potentially high ROI interventions that most teams would not succeed at, such as going to a foreign country and improving governmental decision making in regional offices across the country enough that the government is able to solve serious open problems that had been plaguing the country for decades.

Many of the highest ROI interventions are similarly skill intensive and not amenable to simple back-of-the-envelope calculations, but most discussions I see on the topic, both in person and online, rely heavily on simplistic but irrelevant back-of-the-envelope calculations. This is not just a problem limited to cocktail-party conversations. My friend's intervention was almost killed by the organization she worked for because the organization was infested with what she thinks of "overly simplistic EA thinking", which caused leadership in the organization to try to redirect resources to projects where the computation of expected return was simpler because those projects were thought to be higher impact even though they were, ex post, lower impact. Of course, we shouldn't judge interventions on how they performed ex post since that will overly favor high variance interventions, but I think that someone thinking it through, who was willing to exercise their judgement instead of outsourcing their judgement to a simple metric, could and should say that the intervention in question was a good choice ex ante.

This issue of projects which are more legible getting more funding is an issue across organizations as well as within them. For example, my friend says that, back when GiveWell was mainly or only recommending charities that had simply quantifiable return, she basically couldn't get her friends who worked in other fields to put resources towards efforts that weren't endorsed by GiveWell. People who didn't know about her aid background would say things like "haven't you heard of GiveWell?" when she suggested putting resources towards any particular cause, project, or organization.

I talked to a friend of mine who worked at GiveWell during that time period about this and, according to him, the reason GiveWell initially focused on charities that had easily quantifiable value wasn't that they thought those were the highest impact charities. Instead, it was because, as a young organization, they needed to be credible and it's easier to make a credible case for charities whose value is easily quantifiable. He would not, and he thinks GiveWell would not, endorse donors funnelling all resources into charities endorsed by GiveWell and neglecting other ways to improve the world. But many people want the world to be simple and apply the algorithm "charity on GiveWell list = good; not on GiveWell list = bad" because it makes the world simple for them.

Unfortunately for those people, as well as for the world, the world is not simple.

Coming back to the tech company examples, Laurence Tratt notes something that I've also observed:

One thing I've found very interesting in large organisations is when they realise that they need to do something different (i.e. they're slowly failing and want to turn the ship around). The obvious thing is to let a small team take risks on the basis that they might win big. Instead they tend to form endless committees which just perpetuate the drift that caused the committees to be formed in the first place! I think this is because they really struggle to see people as anything other than fungible, even if they really want to: it's almost beyond their ability to break out of their organisational mould, even when it spells long-term doom.

One lens we can use to look at what's going on is legibility. When you have a complex system, whether that's a company with thousands of engineers or a world with many billions of dollars going to aid work, the system is too complex for any decision maker to really understand, whether that's an exec at a company or a potential donor trying to understand where their money should go. One way to address this problem is by reducing the perceived complexity of the problem via imagining that individuals are fungible, making the system more legible. That produces relatively inefficient outcomes but, unlike trying to understand the issues at hand, it's highly scalable, and if there's one thing that tech companies like, it's doing things that scale, and treating a complex system like it's SimCity or Civilization is highly scalable. When returns are relatively evenly distributed, losing out on potential outlier returns in the name of legibility is a good trade-off. But when ROI is a heavy-tailed distribution, when the right person can, on their paternity leave, increase company revenue of a giant tech company by 0.7% and then much more when they work on that full-time, then severely tamping down on the right side of the curve to improve legibility is very costly and can cost you the majority of your potential returns.

Thanks to Laurence Tratt, Pam Wolf, Ben Kuhn, Peter Bhat Harkins, John Hergenroeder, Andrey Mishchenko, Joseph Kaptur, and Sophia Wisdom for comments/corrections/discussion.

Appendix: re-orgs

A friend of mine recently told me a story about a trendy tech company where they tried to move six people to another project, one that the people didn't want to work on that they thought didn't really make sense. The result was that two senior devs quit, the EM retired, one PM was fired (long story), and three people left the team. The team for both the old project and the new project had to be re-created from scratch.

It could be much worse. In that case, at least there were some people who didn't leave the company. I once asked someone why feature X, which had been publicly promised, hadn't been implemented yet and also the entire sub-product was broken. The answer was that, after about a year of work, when shipping the feature was thought to be weeks away, leadership decided that the feature, which was previously considered a top priority, was no longer a priority and should be abandoned. The team argued that the feature was very close to being done and they just wanted enough runway to finish the feature. When that was denied, the entire team quit and the sub-product has slowly decayed since then. After many years, there was one attempted reboot of the team but, for reasons beyond the scope of this story, it was done with a new manager managing new grads and didn't really re-create what the old team was capable of.

As we've previously seen, an effective team is difficult to create, due to the institutional knowledge that exists on a team, as well as the team's culture, but destroying a team is very easy.

I find it interesting that so many people in senior management roles persist in thinking that they can re-direct people as easily as opening up the city view in Civilization and assigning workers to switch from one task to another when the senior ICs I talk to have high accuracy in predicting when these kinds of moves won't work out.

On the flip side, there are managers who want to maximize the return to their career. At every company I've worked at that wasn't a startup, doing that involves moving up the ladder, which is easiest to do by collecting as many people as possible. At one company I've worked for, the explicitly stated promo criteria are basically "how many people report up to this person".

Tying promotions and compensation to the number of people managed could make sense if you think of people as mostly fungible, but is otherwise an obviously silly idea.
^[return]
This isn't quite this simple when you take into account retention budgets (money set aside from a pool that doesn't come out of the org's normal budget, often used to match offers from people who are leaving), etc., but adding this nuance doesn't really change the fundamental point. ^[return]
There are advantages to a system where people don't have power, such as mitigating abuses of power, various biases, nepotism, etc. One can argue that reducing variance in outcomes by making people powerless is the preferred result, but in winner-take-most markets, which many tech markets are, forcing everyone lowest-common-denominator effectiveness is a recipe for being an also ran.

A specific, small-scale, example of this is the massive advantage companies that don't have a bureaucratic comms/PR approval process for technical blog posts have. The theory behind having the onerous process that most companies have is that the company is protected from downside risk of a bad blog post, but examples of bad engineering blog posts that would've been mitigated by having an onerous process are few and far between, whereas the companies that have good processes for writing publicly get a lot of value that's easy to see.

A larger scale example of this is that the large, now >= $500B companies, all made aggressive moves that wouldn't have been possible at their bureaucracy laden competitors, which allowed them to wipe the floor with their competitors. Of course, many other companies that made serious bets instead of playing it safe failed more quickly than companies trying to play it safe, but those companies at least had a chance, unlike the companies that played it safe.
^[return]
I'm generally skeptical of claims like this. At multiple companies that I've worked for, if you tally up the claimed revenue or user growth wins and compare them to actual revenue or user growth, you can see that there's some funny business going on since the total claimed wins are much larger than the observed total.

Just because I'm generally curious about measurements, I sometimes did my own analysis of people's claimed wins and I almost always came up with an estimate that was much lower than the original estimate. Of course, I generally didn't publish these results internally since that would, in general, be a good way to make a lot of enemies without causing any change. In one extreme case, I found that the experimental methodology one entire org used was broken, causing them to get spurious wins in their A/B tests. I quietly informed them and they did nothing about it, which was the only reasonable move for them since having experiments that systematically showed improvement when none existed was a cheap and effective way for the org to gain more power by having its people get promoted and having more headcount allocated to it. And if anyone with power over the bureaucracy cared about accuracy of results, such a large discrepancy between claimed wins and actual results couldn't exist in the first place.

Anyway, despite my general skepticism of claimed wins in general, I found this person's claimed wins highly credible after checking them myself. A project of theirs, done on their paternity leave (done while on leave because their manager and, really, the organization as well as the company, didn't support the kind of work they were doing) increased the company's revenue by 0.7%, robust and actually increasing in value through a long-term holdback, and they were able to produce wins of that magnitude after leadership was embarrassed into allowing them to do valuable work.

P.S. If you'd like to play along at home, another fun game you can play after figuring out which teams and orgs hit their roadmap goals. For bonus points, plot the percentage of roadmap goals a team hits vs. their headcount growth as well as how predictive hitting last quarter's goals are for hitting next quarter's goals across teams.
^[return]
I've seen quite a few people leave their employers due to location adjustments during the pandemic. In one case, HR insisted the person was actually very well compensated because, even though it might appear as if the person isn't highly paid because they were paid significantly less than many people who were one level below them, according to HR's formula, which included a location-based pay adjustment, the person was one of the highest paid people for their level at the entire company in terms of normalized pay. Putting aside abstract considerations about fairness, for an employee, HR telling them that they're highly paid given their location is like HR having a formula that pays based on height telling an employee that they're well paid for their height. That may be true according to whatever formula HR has but, practically speaking, that means nothing to the employee, who can go work somewhere that has a smaller height-based pay adjustment.

Companies were able to get away with severe location-based pay adjustments with no cost to themselves before the pandemic. But, since the pandemic, a lot of companies have ramped up remote hiring and some of those companies have relatively small location-based pay adjustments, which has allowed them to disproportionately hire away who they choose from companies that still maintain severe location-based pay adjustments.
^[return]
Technically, their budget ended up being higher than this because one team member contracted typhoid and paid for some medical expenses from their personal budget and not from the organization's budget, but $12/(person-day), the organizational funding, is a pretty good approximation. ^[return]

Culture matters

Mon, 08 Nov 2021 00:00:00 +0000

Three major tools that companies have to influence behavior are incentives, process, and culture. People often mean different things when talking about these, so I'll provide an example of each so we're on the same page (if you think that I should be using a different word for the concept, feel free to mentally substitute that word).

Getting people to show up to meetings on time
- Incentive: dock pay for people who are late
- Process: don't allow anyone who's late into the meeting
- Culture: people feel strongly about showing up on time
Getting people to build complex systems
- Incentive: require complexity in promo criteria
- Process: make process for creating or executing on a work item so heavyweight that people stop doing simple work
- Culture: people enjoy building complex systems and/or building complex systems results in respect from peers and/or prestige
Avoiding manufacturing defects
- Incentive: pay people per good item created and/or dock pay for bad items
- Process: have QA check items before shipment and discard bad items
- Culture: people value excellence and try very hard to avoid defects

If you read "old school" thought leaders, many of them advocate for a culture-only approach, e.g., Ken Thompson saying, to reduce bug rate, that tools (which, for the purposes of this post, we'll call process) aren't the answer, having people care to and therefore decide to avoid writing bugs is the answer or Bob Martin saying "The solution to the software apocalypse is not more tools. The solution is better programming discipline."

The emotional reaction those kinds of over-the-top statements evoke combined with the ease of rebutting them has led to a backlash against cultural solutions, leading people to say things like "you should never say that people need more discipline and you should instead look at the incentives of the underlying system", in the same way that the 10x programmer meme and the associated comments have caused a backlash that's led to people to say things like velocity doesn't matter at all or there's absolutely no difference in velocity between programmers (as Jamie Brandon has noted, a lot of velocity comes down to caring about and working on velocity, so this is also part of the backlash against culture).

But if we look at quantifiable output, we can see that, even if processes and incentives are the first-line tools a company should reach for, culture also has a large impact. For example, if we look at manufacturing defect rate, some countries persistently have lower defect rates than others on a timescale of decades¹, generally robust across companies, even when companies are operating factories in multiple countries and importing the same process and incentives to each factory to the extent that's possible, due to cultural differences that impact how people work.

Coming back to programming, Jamie's post on "moving faster" notes:

The main thing that helped is actually wanting to be faster.

Early on I definitely cared more about writing 'elegant' code or using fashionable tools than I did about actually solving problems. Maybe not as an explicit belief, but those priorities were clear from my actions.

I probably also wasn't aware how much faster it was possible to be. I spent my early career working with people who were as slow and inexperienced as I was.

Over time I started to notice that some people are producing projects that are far beyond what I could do in a single lifetime. I wanted to figure out how to do that, which meant giving up my existing beliefs and trying to discover what actually works.

I was lucky to have the opposite experience starting out since my first full-time job was at Centaur, a company that, at the time, had very high velocity/productivity. I'd say that I've only ever worked on one team with a similar level of productivity, and that's my current team, but my current team is fairly unusual for a team at a tech company (e.g., the median level on my team is "senior staff")². A side effect of having started my career at such a high velocity company is that I generally find the pace of development slow at big companies and I see no reason to move slowly just because that's considered normal. I often hear similar comments from people I talk to at big companies who've previously worked at non-dysfunctional but not even particularly fast startups. A regular survey at one of the trendiest companies around asks "Do you feel like your dev speed is faster or slower than your previous job?" and the responses are bimodal, depending on whether the respondent came from a small company or a big one (with dev speed at TrendCo being slower than at startups and faster than at larger companies).

There's a story that, IIRC, was told by Brian Enos, where he was practicing timed drills with the goal of practicing until he could complete a specific task at or under his usual time. He was having a hard time hitting his normal time and was annoyed at himself because he was slower than usual and kept at it until he hit his target, at which point he realized he misremembered the target and was accidentally targeting a new personal best time that was better than he thought was possible. While it's too simple to say that we can achieve anything if we put our minds to it, almost none of us are operating at anywhere near our capacity and what we think we can achieve is often a major limiting factor. Of course, at the limit, there's a tradeoff between velocity and quality and you can't get velocity "for free", but, when it comes to programming, we're so far from the Pareto frontier that there are free wins if you just realize that they're available.

One way in which culture influences this is that people often absorb their ideas of what's possible from the culture they're in. For a non-velocity example, one thing I noticed after attending RC was that a lot of speakers at the well-respected non-academic non-enterprise tech conferences, like Deconstruct and Strange Loop, also attended RC. Most people hadn't given talks before attending RC and, when I asked people, a lot of people had wanted to give talks but didn't realize how straightforward the process for becoming a speaker at "big" conferences is (have an idea, write it down, and then submit what you wrote down as a proposal). It turns out that giving talks at conferences is easy to do and a major blocker for many folks is just knowing that it's possible. In an environment where lots of people give talks and, where people who hesitantly ask how they can get started are told that it's straightforward, a lot of people will end up giving talks. The same thing is true of blogging, which is why a disproportionately large fraction of widely read programming bloggers started blogging seriously after attending RC. For many people, the barrier to starting a blog is some combination of realizing it's feasible to start a blog and that, from a technical standpoint, it's very easy to start a blog if you just pick any semi-reasonable toolchain and go through the setup process. And then, because people give talks and write blog posts, they get better at giving talks and writing blog posts so, on average, RC alums are probably better speakers and writers than random programmers even though there's little to no skill transfer or instruction at RC.

Another kind of thing where culture can really drive skills are skills that are highly attitude dependent. An example of this is debugging. As Julia Evans has noted, having a good attitude is a major component of debugging effectiveness. This is something Centaur was very good at instilling in people, to the point that nearly everyone in my org at Centaur would be considered a very strong debugger by tech company standards.

At big tech companies, it's common to see people give up on bugs after trying a few random things that didn't work. In one extreme example, someone I know at a mid-10-figure tech company said that it never makes sense to debug a bug that takes more than a couple hours to debug because engineer time is too valuable to waste on bugs that take longer than that to debug, an attitude this person picked up from the first team they worked on. Someone who picks up that kind of attitude about debugging is unlikely to become a good debugger until they change their attitude, and many people, including this person, carry the attitudes and habits they pick up at their first job around for quite a long time³.

By tech standards, Centaur is an extreme example in the other direction. If you're designing a CPU, it's not considered ok to walk away from a bug that you don't understand. Even if the symptom of the bug isn't serious, it's possible that the underlying cause is actually serious and you won't observe the more serious symptom until you've shipped a chip, so you have to go after even seemingly trivial bugs. Also, it's pretty common for there to be no good or even deterministic reproduction of a bug. The repro is often something like "run these programs with these settings on the system and then the system will hang and/or corrupt data after some number of hours or days". When debugging a bug like that, there will be numerous wrong turns and dead ends, some of which can eat up weeks or months. As a new employee watching people work on those kinds of bugs, what I observed was that people would come in day after day and track down bugs like that, not getting frustrated and not giving up. When that's the culture and everyone around you has that attitude, it's natural to pick up the same attitude. Also, a lot of practical debugging skill is applying tactical skills picked up from having debugged a lot of problems, which naturally falls out of spending a decent amount of time debugging problems with a positive attitude, especially with exposure to hard debugging problems.

Of course, most bugs at tech companies don't warrant months of work, but there's a big difference between intentionally leaving some bugs undebugged because some bugs aren't worth fixing and having poor debugging skills from never having ever debugged a serious bug and then not being able to debug any bug that isn't completely trivial.

Cultural attitudes can drive a lot more than individual skills like debugging. Centaur had, per capita, by far the lowest serious production bug rate of any company I've worked for, at well under one per year with ~100 engineers. By comparison, I've never worked on a team 1/10th that size that didn't have at least 10x the rate of serious production issues. Like most startups, Centaur was very light on process and it was also much lighter on incentives than the big tech companies I've worked for.

One component of this was that there was a culture of owning problems, regardless of what team you were on. If you saw a problem, you'd fix it, or, if there was a very obvious owner, you'd tell them about the problem and they'd fix it. There weren't roadmaps, standups, kanban, or anything else to get people to work on important problems. People did it without needed to be reminded or prompted.

That's the opposite of what I've seen at two of the three big tech companies I've worked for, where the median person avoids touching problems outside of their team's mandate like the plague, and someone who isn't politically savvy who brings up a problem to another team will get a default answer of "sorry, this isn't on our roadmap for the quarter, perhaps we can put this on the roadmap in [two quarters from now]", with the same response repeated to anyone naive enough to bring up the same issue two quarters later. At every tech company I've worked for, huge, extremely costly, problems slip through the cracks all the time because no one wants to pick them up. I never observed that happening at Centaur.

A side effect of big company tech culture is that someone who wants to actually do the right thing can easily do very high (positive) impact work by just going around and fixing problems that any intern could solve, if they're willing to ignore organizational processes and incentives. You can't shake a stick without hitting a problem that's worth more to the company than my expected lifetime earnings and it's easy to knock off multiple such problems per year. Of course, the same forces that cause so many trivial problems to not get solved mean that people who solve those problems don't get rewarded for their work⁴.

Conversely, in eight years at Centaur, I only found one trivial problem whose fix was worth more than I'll earn in my life because, in general, problems would get solved before they got to that point. I've seen various big company attempts to fix this problem using incentives (e.g., monetary rewards for solving important problems) and process (e.g., making a giant list of all projects/problems, on the order of 1000 projects, and having a single person order them, along with a bureaucratic system where everyone has to constantly provide updates on their progress via JIRA so that PMs can keep sending progress updates to the person who's providing a total order over the work of thousands of engineers⁵), but none of those attempts have worked even half as well as having a culture of ownership (to be fair to incentives, I've heard that FB uses monetary rewards to good effect, but I've failed FB's interview three times, so I haven't been able to observe how that works myself).

Another component that resulted in a relatively low severe bug rate was that, across the company at Centaur, people cared about quality in a way that I've never seen at a team level let alone at an org level at a big tech company. When you have a collection of people who care about quality and feel that no issue is off limits, you'll get quality. And when you onboard people, as long as you don't do it so quickly that the culture is overwhelmed by the new hires, they'll also tend to pick up the same habits and values, especially when you hire new grads. While it's not exactly common, there are plenty of small firms out there with a culture of excellence that generally persists without heavyweight processes or big incentives, but this doesn't work at big tech companies since they've all gone through a hypergrowth period where it's impossible to maintain such extreme (by mainstream standards) cultural values.

So far, we've mainly discussed companies transmitting culture to people, but something that I think is no less important is how people then carry that culture with them when they leave. I've been reasonably successful since changing careers from hardware to software and I think that, among the factors that are under my control, one of the biggest ones is that I picked up effective cultural values from the first place I worked full-time and continue to operate as in the same way, which is highly effective. I've also seen this in other people who, career-wise, "grew up" in a culture of excellence and then changed to a different field where there's even less direct skill transfer, e.g., from skiing to civil engineering. Relatedly, if you read books from people who discuss the reasons why they were very effective in their field, e.g., Practical Shooting by Brian Enos, Playing to Win by Dan Sirlin, etc., the books tend to contain the same core ideas (serious observation and improvement of skills, the importance of avoiding emotional self-sabotage, the importance of intuition, etc.).

Anyway, I think that cultural transmission of values and skills is an underrated part of choosing a job (some things I would consider overrated are prestige and general reputation and that people should be thoughtful about what cultures they spend time in because not many people are able to avoid at least somewhat absorbing the cultural values around them⁶.

Although this post is oriented around tech, there's nothing specific to tech about this. A classic example is how idealistic students will go to law school with the intention of doing "save the world" type work and then absorb the prestige-transmitted cultural values of students around then go into the most prestigious job they can get which, when it's not a clerkship, will be a "BIGLAW" job that's the opposite of "save the world" work. To first approximation, everyone thinks "that will never happen to me", but from having watched many people join organizations where they initially find the values and culture very wrong, almost no one is able to stay without, to some extent, absorbing the values around them; very few people are ok with everyone around them looking at them like they're an idiot for having the wrong values.

Appendix: Bay area culture

One thing I admire about the bay area is how infectious people's attitudes are with respect to trying to change the world. Everywhere I've lived, people gripe about problems (the mortgage industry sucks, selling a house is high friction, etc.). Outside of the bay area, it's just griping, but in the bay, when I talk to someone who was griping about something a year ago, there's a decent chance they've started a startup to try to address one of the problems they're complaining about. I don't think that people in the bay area are fundamentally different from people elsewhere, it's more that when you're surrounded by people who are willing to walk away from their jobs to try to disrupt an entrenched industry, it seems pretty reasonable to do the same thing (which also leads to network effects that make it easier from a "technical" standpoint, e.g., easier fundraising). There's a kind of earnestness in these sorts of complaints and attempts to fix them that's easy to mock, but that earnestness is something I really admire.

Of course, not all of bay area culture is positive. The bay has, among other things, a famously flaky culture to an extent I found shocking when I moved there. Relatively early on in my time there, I met some old friends for dinner and texted them telling them I was going to be about 15 minutes late. They were shocked when I showed up because they thought that saying that I was going to be late actually meant that I wasn't going to show up (another norm that surprised me that's an even more extreme version was that, for many people, not confirming plans shortly before their commencement means that the person has cancelled, i.e., plans are cancelled by default).

A related norm that I've heard people complain about is how management and leadership will say yes to everything in a "people pleasing" move to avoid conflict, which actually increases conflict as people who heard "yes" as a "yes" and not as "I'm saying yes to avoid saying no but don't actually mean yes" are later surprised that "yes" meant "no".

Appendix: Centaur's hiring process

One comment people sometimes have when I talk about Centaur is that they must've had some kind of incredibly rigorous hiring process that resulted in hiring elite engineers, but the hiring process was much less selective than any "brand name" big tech company I've worked for (Google, MS, and Twitter) and not obviously more selective than boring, old school, companies I've worked for (IBM and Micron). The "one weird trick" was onboarding, not hiring.

For new grad hiring (and, proportionally, we hired a lot of new grads), recruiting was more difficult than at any other company I'd worked for. Senior hiring wasn't difficult because Centaur had a good reputation locally, in Austin, but among new grads, no one had heard of us and no one wanted to work for us. When I recruited at career fairs, I had to stand out in front of our booth and flag down people who were walking by to get anyone to talk to us. This meant that we couldn't be picky about who we interviewed. We really ramped up hiring of new grads around the time that Jeff Atwood popularized the idea that there are a bunch of fake programmers out there applying for jobs and that you'd end up with programmers who can't program if you don't screen people out with basic coding questions in his very influential post, Why Can't Programmers.. Program? (the bolding below is his):

I am disturbed and appalled that any so-called programmer would apply for a job without being able to write the simplest of programs. That's a slap in the face to anyone who writes software for a living. ... It's a shame you have to do so much pre-screening to have the luxury of interviewing programmers who can actually program. It'd be funny if it wasn't so damn depressing

Since we were a relatively coding oriented hardware shop (verification engineers primarily wrote software and design engineers wrote a lot of tooling), we tried asking a simple coding question where people were required to code up a function to output Fibonacci numbers given a description of how to compute them (the naive solution was fine; a linear time or faster solution wasn't necessary). We dropped that question because no one got it without being walked through the entire thing in detail, which meant that the question had zero discriminatory power for us.

Despite not really asking a coding question, people did things like write hairy concurrent code (internal processor microcode, which often used barriers as the concurrency control mechanism) and create tools at a higher velocity and lower bug rate than I've seen anywhere else I've worked.

We were much better off avoiding hiring the way everyone else was because that meant we tried to and did hire people that other companies weren't competing over. That wouldn't make sense if other companies were using techniques that were highly effective but other companies were doing things like asking people to code FizzBuzz and then whiteboard some algorithms. While one might expect that doing algorithms interviews would result in hiring people who can solve the exact problems people ask about in interviews, but this turns out not to be the case. The other thing we did was have much less of a prestige filter than most companies, which also let us hire great engineers that other companies wouldn't even consider.

We did have some people who didn't work out, but it was never because they were "so-called programmers" who couldn't "write the simplest of programs". I do know of two cases of "fake programmers being hired who literally couldn't program, but both were at prestigious companies that have among the most rigorous coding interviews done at tech companies. In one case, it was discovered pretty quickly that the person couldn't code and people went back to review security footage from the interview and realized that the person who interviewed wasn't the person who showed up to do the job. In the other, the person was able to sneak under the radar at Google for multiple years before someone realized that the person never actually wrote any code and tasks only got completed when they got someone else to do the task. The person who realized eventually scheduled a pair programming session, where they discovered that the person wasn't able to write a loop, didn't know the difference between = and ==, etc., despite being a "senior SWE" (L5/T5) at Google for years.

I'm not going to say that having coding questions will never save you from hiring a fake programmer, but the rate of fake programmers appears to be very low enough that a small company can go a decade without hiring a fake programmer without asking a coding question and larger companies that are targeted by scammers still can't really avoid them even after asking coding questions.

Appendix: importing culture

Although this post is about how company culture impacts employees, of course employees impact company culture as well. Something that seems underrated in hiring, especially of senior leadership and senior ICs, is how they'll impact culture. Something I've repeatedly seen, both up close, and from a distance, is the hiring of a new senior person who manages to import their culture, which isn't compatible with the existing company's culture, causing serious problems and, frequently, high attrition, as things settle down.

Now that I've been around for a while, I've been in the room for discussions on a number of very senior hires and I've never seen anyone else bring up whether or not someone will import incompatible cultural values other than really blatant issues, like the person being a jerk or making racist or sexist comments in the interview.

Thanks to Peter Bhat Harkins, Laurence Tratt, Julian Squires, Anja Boskovic, Tao L., Justin Blank, Ben Kuhn, V. Buckenham, Mark Papadakis, and Jamie Brandon for comments/corrections/discussion.

What countries actually have low defect rate manufacturing is often quite different from the general public reputation. To see this, you really need to look at the data, which is often NDA'd and generally only spread in "bar room" discussions. ^[return]
: Centaur had what I sometimes called "the world's stupidest business model", competing with Intel on x86 chips starting in 1995, so it needed an extremely high level of productivity to survive. Through the bad years, AMD survived by selling off pieces of itself to fund continued x86 development and every other competitor (Rise, Cyrix, TI, IBM, UMC, NEC, and Transmeta) got wiped out. If you compare Centaur to the longest surviving competitor that went under, Transmeta, Centaur just plain shipped more quickly, which is a major reason that Centaur was able to survive until 2021 (when it was pseudo-acqui-hired by Intel) and Transmeta went in 2009 under after burning through ~$1B of funding (including payouts from lawsuits). Transmeta was founded in 1995 and shipped its first chip in 2000, which was considered a normal tempo for the creation of a new CPU/microarchitecture at the time; Centaur shipped its first chip in 1997 and continued shipping at a high cadence until 2010 or so (how things got slower and slower until the company stalled out and got acqui-hired is a topic for another post). ^[return]
This person initially thought the processes and values on their first team were absurd before the cognitive dissonance got to them and they became a staunch advocate of the company's culture, which is typical for folks joining a company that has obviously terrible practices. ^[return]
This illustrates one way in which incentives and culture are non-independent. What I've seen in places where this kind of work isn't rewarded is that, due to the culture, making these sorts of high-impact changes frequently requires burnout inducing slogs, at the end of which there is no reward, which causes higher attrition among people who have a tendency to own problems and do high-impact work. What I've observed in environments like this is that the environment differentially retains people who don't want to own problems, which then makes make more difficult and more burnout inducing for new people who join who attempt to fix serious problems. ^[return]
I'm adding this note because, when I've described this to people, many people thought that this must be satire. It is not satire. ^[return]
As with many other qualities, there can be high variance within a company as well as across companies. For example, there's a team I sometimes encountered at a company I've worked for that has a very different idea of customer service than most of the company and people who join that team and don't quickly bounce usually absorb their values.

Much of the company has a pleasant attitude towards internal customers, but this team has a "the customer is always wrong" attitude. A funny side effect of this is that, when I dealt with the team, I got the best support when a junior engineer who hadn't absorbed the team's culture was on call, and sometimes a senior engineer would say something was impossible or infeasible only to have a junior engineer follow up and trivially solve the problem.
^[return]

Willingness to look stupid

Thu, 21 Oct 2021 00:00:00 +0000

People frequently¹ think that I'm very stupid. I don't find this surprising, since I don't mind if other people think I'm stupid, which means that I don't adjust my behavior to avoid seeming stupid, which results in people thinking that I'm stupid. Although there are some downsides to people thinking that I'm stupid, e.g., failing interviews where the interviewer very clearly thought I was stupid, I think that, overall, the upsides of being willing to look stupid have greatly outweighed the downsides.

I don't know why this one example sticks in my head but, for me, the most memorable example of other people thinking that I'm stupid was from college. I've had numerous instances where more people thought I was stupid and also where people thought the depths of my stupidity was greater, but this one was really memorable for me.

Back in college, there was one group of folks that, for whatever reason, stood out to me as people who really didn't understand the class material. When they talked, they said things that didn't make any sense, they were struggling in the classes and barely passing, etc. I don't remember any direct interactions but, one day, a friend of mine who also knew them remarked to me, "did you know [that group] thinks you're really dumb?". I found that interesting and asked why. It turned out the reason was that I asked really stupid sounding questions.

In particular, it's often the case that there's a seemingly obvious but actually incorrect reason something is true, a slightly less obvious reason the thing seems untrue, and then a subtle and complex reason that the thing is actually true². I would regularly figure out that the seemingly obvious reason was wrong and then ask a question to try to understand the subtler reason, which sounded stupid to someone who thought the seemingly obvious reason was correct or thought that the refutation to the obvious but incorrect reason meant that the thing was untrue.

The benefit from asking a stupid sounding question is small in most particular instances, but the compounding benefit over time is quite large and I've observed that people who are willing to ask dumb questions and think "stupid thoughts" end up understanding things much more deeply over time. Conversely, when I look at people who have a very deep understanding of topics, many of them frequently ask naive sounding questions and continue to apply one of the techniques that got them a deep understanding in the first place.

I think I first became sure of something that I think of as a symptom of the underlying phenomenon via playing competitive video games when I was in high school. There were few enough people playing video games online back then that you'd basically recognize everyone who played the same game and could see how much everyone improved over time. Just like I saw when I tried out video games again a couple years ago, most people would blame external factors (lag, luck, a glitch, teammates, unfairness, etc.) when they "died" in the game. The most striking thing about that was that people who did that almost never became good and never became great. I got pretty good at the game³ and my "one weird trick" was to think about what went wrong every time something went wrong and then try to improve. But most people seemed more interested in making an excuse to avoid looking stupid (or maybe feeling stupid) in the moment than actually improving, which, of course, resulted in them having many more moments where they looked stupid in the game.

In general, I've found willingness to look stupid to be very effective. Here are some more examples:

Going into an Apple store and asking for (and buying) the computer that comes in the smallest box, which I had a good reason to want at the time
- The person who helped me, despite being very polite, also clearly thought I was a bozo and kept explaining things like "the size of the box and the size of the computer aren't the same". Of course I knew that, but I didn't want to say something like "I design CPUs. I understand the difference between the size of the box the computer comes and in the size of the computer and I know it's very unusual to care about the size of the box, but I really want the one that comes in the smallest box". Just saying the last bit without establishing any kind of authority didn't convince the person
- I eventually asked them to humor me and just bring out the boxes for the various laptop models so I could see the boxes, which they did, despite clearly thinking that my decision making process made no sense (I also tried explaining why I wanted the smallest box but that didn't work)
Covid: I took this seriously relatively early on and bought a half mask respirator on 2020-01-26 and was using N95s I'd already had on hand for the week before (IMO, the case that covid was airborne and that air filtration would help was very strong based on the existing literature on SARS contact tracing, filtration of viruses from air filters, and viral load)
- It wasn't until many months later that people didn't generally look at me like I was an idiot, and even as late 2020-08, I would sometimes run into people who would verbally make fun me
- On the flip side, the person I was living with at the time didn't want to wear the mask I got her since she found it too embarrassing to wear a mask for the 1 hour round-trip BART ride to and from a maker space when no one else was on BART or at the maker space. She became one of the early bay area covid cases, which gave her a case of long covid that floored her for months
  - When she got covid, I tried to convince her that she should tell people at the maker space she'd been going to that she got covid so they would know that they were exposed and could take appropriate precautions in order to avoid accidentally spreading covid, but she also found admitting that she might've spread covid to people too embarrassing to do (in retrospect, I should've just called up the maker space and told them)
- A semi-related one is that, when Canada started doing vaccines, I wanted to get Moderna even though the general consensus online and in my social circles was that Pfizer was preferred
  - One reason for this was it wasn't clear if the government was going to allow mixing vaccines and the delivery schedule implied that there would be a very large shortage of Pfizer for 2nd doses as well as a large supply of Moderna
  - Another thought that had crossed my mind was that Moderna is basically "more stuff" than Pfizer and might convey better immunity in some cases, in the same way that some populations get high-dose flu shots to get better immunity
Work: I generally don't worry about proposals or actions looking stupid
- I can still remember the first time I explicitly ran into this. This was very early on my career, when I was working on chip verification. Shortly before tape-out, the head of verification wanted to use our compute resources to re-run a set of tests that had virtually no chance of finding any bugs (they'd been run thousands of times before) instead of running the usual mix of tests, which would include a lot of new generated tests that had a much better chance of finding a bug (this was both logically and empirically true). I argued that we should run the tests that reduced the odds of shipping with a show stopping bug (which would cost us millions of dollars and delay shipping by three months), but the head of the group said that we would look stupid and incompetent if there was a bug that could've been caught by one of our old "golden" tests that snuck in since the last time we'd run those tests
  - At the time, I was shocked that somebody would deliberately do the wrong thing in order to reduce the odds of potentially looking stupid (and, really, only looking stupid to people who wouldn't understand the logic of running the best available mix of tests; since there weren't non-technical people anywhere in the management chain, anyone competent should understand the reasoning) but now that I've worked at various companies in multiple industries, I see that most people would choose to do the wrong thing to avoid potentially looking stupid to people who are incompetent. I see the logic, but I think that it's self-sabotaging to behave that way and that the gains to my career for standing up for what I believe are the right thing have been so large that, even if the next ten times I do so, I get unlucky and it doesn't work out, that still won't erase the gains I've made from having done the right thing many times in the past
Air filtration: I did a bit of looking into the impact of air quality on health and bought air filters for my apartment in 2012
- Friends have been chiding about this for years and strangers, dates, and acquaintances, will sometimes tell me, with varying levels of bluntness, that I'm being paranoid and stupid
- I added more air filtration capacity when I moved to a wildfire risk area after looking into wildfire risk which increased the rate and bluntness of people telling me that I'm weird for having air filters
  - I've been basically totally unimpacted by wildfire despite living through a fairly severe wildfire season twice
  - Other folks I know experienced some degree of discomfort, with a couple people developing persistent issues after the smoke exposure (in one case, persistent asthma, which they didn't have before or at least hadn't noticed before)
Learning things that are hard for me: this is a "feeling stupid" thing and not a "looking stupid" thing, but when I struggle with something, I feel really dumb, as in, I have a feeling/emotion that I would verbally describe as "feeling dumb"
- When I was pretty young, I think before I was a teenager, I noticed that this happened when I learned things that were hard for me and tried to think of this feeling as "the feeling of learning something" instead of "feeling dumb", which half worked (I now associate that feeling with the former as well as the latter)
Asking questions: covered above, but I frequently ask questions when there's something I don't understand or know, from basic stuff, "what does [some word] mean?" to more subtle stuff.
- On the flip side, one of the most common failure modes I see with junior engineers is when someone will be too afraid to look stupid to ask questions and then learn very slowly as a result; in some cases, this is so severe it results in them being put on a PIP and then getting fired
  - I'm sure there are other reasons this can happen, like not wanting to bother people, but in the cases where I've been close enough to the situation to ask, it was always embarrassment and fear of looking stupid
  - I try to be careful to avoid this failure mode when onboarding interns and junior folks and have generally been sucessful, but it's taken me up to six weeks to convince people that it's ok for them to ask questions and, until that happens, I have to constantly ask them how things are going to make sure they're not stuck. That works fine if someone is my intern, but I can observe that many intern and new hire mentors do not do this and that often results in a bad outcome for all parties
    - In almost every case, the person had at least interned at other companies, but they hadn't learned that it was ok to ask questions. P.S. if you're a junior engineer at a place where it's not ok to ask questions, you should look for another job if circumstances permit
Not making excuses for failures: covered above for video games, but applies a lot more generally
When learning, deliberately playing around in the area between success and failure (this applies to things like video games and sports as well as abstract intellectual pursuits)
- An example would be, when learning to climb, repeatedly trying the same easy move over and over again in various ways to understand what works better and what works worse. I've had strangers make fun of me and literally point at me and make snide comments to their friends while I'm doing things like this
- When learning to drive, I wanted to set up some cones and drive so that I barely hit them, to understand where the edge of the car is. My father thought this idea was very stupid and I should just not hit things like curbs or cones
Car insurance: the last time I bought car insurance, I had to confirm three times that I only wanted coverage for damage I do to others with no coverage for damage to my own vehicle if I'm at fault. The insurance agent was unable to refrain from looking at me like I'm an idiot and was more incredulous each time they asked if I was really sure
The styling and content on this website: I regularly get design folks and typographers telling me how stupid the design is, frequently in ways that become condescending very quickly if I engage with them
- But, when I tested out switching to the current design from the generally highly lauded Octopress design, this one got much better engagement when a user landed on the site and also appeared to get passed around a lot more as well
- When I've compared my traffic numbers to major corporate blogs, my blog completely dominates most < $100B companies (e.g., it gets an order of magnitude more traffic than my employer's blog and my employer is a $50B company)
- When I started my blog (and this is still true today), writing advice for programming blogs was to keep it short, maybe 500 to 1000 words. Most of my blog posts are 5000 to 10000 words
Taking my current job, which almost everyone thought was a stupid idea
- Closely related: quitting my job at Centaur to attend RC and then eventually changing fields into software (I don't think this would be considered as stupid now, but it was thought to be a very stupid thing to do in 2013)
Learning a sport or video game: I try things out to understand what happens when you do them, which often results in other people thinking that I'm a complete idiot when the thing looks stupid, but being willing to look stupid helps me improve relatively quickly
Medical care: I've found that a lot of doctors are very confident in their opinion and get condescending pretty fast if you disagree
- And yet, in the most extreme case, I would have died if I listened to my doctor; in the next most extreme case, I would have gone blind
- When getting blood draws, I explain to people that I'm deceptively difficult to draw from and tell them what's worked in the past
  - About half the time, the nurse or phlebotomist takes my comments seriously, generally resulting in a straightforward and painless or nearly painless blood draw
  - About half the time, the nurse or phlebotomist looks at me like I'm an idiot and makes angry and/or condescending comments towards me; so far, everyone who's done this has failed to draw blood and/or given me a hematoma
  - I've had people tell me that I'm probably stating my preferences an offensive way and that I should be more polite; I've then invited them along with me to observe and no one has ever had a suggestion on how I could state things different to elicit a larger fraction of positive responses; in general, people are shocked and upset when they see how nurses and phlebotomists respond
  - In retrospect, I should probably just get up and leave when someone has the "bad" response, which will probably increase the person's feeling that I'm stupid
  - One issue I have (and not the main one that makes it hard to "get a stick") is that, during a blood draw, the blood will slow down and then usually stop. Some nurses like to wiggle the needle around to see if that starts things up again, which sometimes works (maybe 50/50) and will generally leave me with a giant bruise or a hematoma or both. After this happened a few times, I asked if getting my blood flowing (e.g., by moving around a lot before a blood draw) could make a difference and every nurse or phlebotomist I talked to said that was silly and that it wouldn't make any difference. I tried it anyway and that solved this problem, although I still have the problem of being hard to stick properly
Interviews: I'm generally not adversarial in interviews, but I try to say things that I think are true and try to avoid saying things that I think are false and this frequently causes interviewers to think that I'm stupid (I generally fail interviews at a fairly high rate, so who knows for sure if this is related, but having someone look at you like you're an idiot or start using a condescending tone of voice with condescending body language after you say something "stupid" seems like a bad sign).
Generally trying to improve at things as well as being earnest
- Even before "tryhard" was an insult, a lot of people in my extended social circles thought that being a tryhard was idiotic and that one shouldn't try and should instead play it cool (this was before I worked as an engineer; as an engineer, I think that effort is more highly respected than among my classmates from school as well as internet folks I knew back when I was in school)
Generally admitting when I'm bad or untalented at stuff, e.g., mentioning that I struggled to learn to program in this post; an interviewer at Jane Street really dug into what I'd written in that post and tore me a new one for that post (it was the most hostile interview I've ever experienced by a very large margin), which is the kind of thing that sometimes happens when you're earnest and put yourself out there, but I still view the upsides as being greater than the downsides
Recruiting: I have an unorthodox recruiting pitch which candidly leads with the downsides, often causing people to say that I'm a terrible recruiter (or sarcastically say that I'm a great recruiter); I haven't publicly written up the pitch (yet?) because it's negative enough that I'm concerned that I'd be fired for putting it on the internet
- I have never failed to close a full-time candidate (I once failed to close an intern candidate) and have brought in a lot of people who never would've considered working for us otherwise. My recruiting pitch sounds comically stupid, but it's much more effective than the standard recruiting spiel most people give
Posting things on the internet: self explanatory

Although most of the examples above are "real life" examples, being willing to look stupid is also highly effective at work. Besides the obvious reason that it allows you to learn faster and become more effective, it also makes it much easier to find high ROI ideas. If you go after trendy or reasonable sounding ideas, to do something really extraordinary, you have to have better ideas/execution than everyone else working on the same problem. But if you're thinking about ideas that most people consider too stupid to consider, you'll often run into ideas that are both very high ROI as well as simple and easy that anyone could've done had they not dismissed the idea out of hand. It may still technically be true that you need to have better execution than anyone else who's trying the same thing, but if no one else trying the same thing, that's easy to do!

I don't actually have to be nearly as smart or work nearly as hard as most people to get good results. If I try to solve some a problem by doing what everyone else is doing and go looking for problems where everyone else is looking, if I want to do something valuable, I'll have to do better than a lot of people, maybe even better than everybody else if the problem is really hard. If the problem is considered trendy, a lot of very smart and hardworking people will be treading the same ground and doing better than that is very difficult. But I have a dumb thought, one that's too stupid sounding for anyone else to try, I don't necessarily have to be particularly smart or talented or hardworking to come up with valuable solutions. Often, the dumb solution is something any idiot could've come up with and the reason the problem hasn't been solved is because no one was willing to think the dumb thought until an idiot like me looked at the problem.

Overall, I view the upsides of being willing to look stupid as much larger than the downsides. When it comes to things that aren't socially judged, like winning a game, understanding something, or being able to build things due to having a good understanding, it's all upside. There can be downside for things that are "about" social judgement, like interviews and dates but, even there, I think a lot of things that might seem like downsides are actually upsides.

For example, if a date thinks I'm stupid because I ask them what a word means, so much so that they show it in their facial expression and/or tone of voice, I think it's pretty unlikely that we're compatible, so I view finding that out sooner rather than later as upside and not downside.

Interviews are the case where I think there's the most downside since, at large companies, the interviewer likely has no connection to the job or your co-workers, so them having a pattern of interaction that I would view as a downside has no direct bearing on the work environment I'd have if I were offered the job and took it. There's probably some correlation but I can probably get much more signal on that elsewhere. But I think that being willing to say things that I know have a good chance of causing people to think I'm stupid is a deeply ingrained enough habit that it's not worth changing just for interviews and I can't think of another context where the cost is nearly as high as it is in interviews. In principle, I could probably change how I filter what I say only in interviews, but I think that would be a very large amount of work and not really worth the cost. An easier thing to do would be to change how I think so that I reflexively avoid thinking and saying "stupid" thoughts, which a lot of folks seem to do, but that seems even more costly.

Appendix: do you try to avoid looking stupid?

On reading a draft of this, Ben Kuhn remarked,

[this post] caused me to realize that I'm actually very bad at this, at least compared to you but perhaps also just bad in general.

I asked myself "why can't Dan just avoid saying things that make him look stupid specifically in interviews," then I started thinking about what the mental processes involved must look like in order for that to be impossible, and realized they must be extremely different from mine. Then tried to think about the last time I did something that made someone think I was stupid and realized I didn't have a readily available example)

One problem I expect this post to have is that most people will read this and decide that they're very willing to look stupid. This reminds me of how most people, when asked, think that they're creative, innovative, and take big risks. I think that feels true since people often operate at the edge of their comfort zone, but there's a difference between feeling like you're taking big risks and taking big risks, e.g., when asked, someone I know who is among the most conservative people I know thinks that they take a lot of big risks and names things like sometimes jaywalking as risk that they take.

This might sound ridiculous, as ridiculous as saying that I run into hundreds to thousands of software bugs per week, but I think I run into someone who thinks that I'm an idiot in a way that's obvious to me around once a week. The car insurance example is from a few days ago, and if I wanted to think of other recent examples, there's a long string of them.

If you don't regularly have people thinking that you're stupid, I think it's likely that at least one of the following is true

You have extremely filtered interactions with people and basically only interact with people of your choosing and you have filtered out any people who have the reactions describe in this post
- If you count internet comments, then you do not post things to the internet or do not read internet comments
You are avoiding looking stupid
You are not noticing when people think you're stupid

I think the last one of those is unlikely because, while I sometimes have interactions like the school one described, where the people were too nice to tell me that they think I'm stupid and I only found out via a third party, just as often, the person very clearly wants me to know that they think I'm stupid. The way it happens reminds me of being a pedestrian in NYC, where, when a car tries to cut you off when you have right of way and fails (e.g., when you're crossing a crosswalk and have the walk signal and the driver guns it to try to get in front of you to turn right), the driver will often scream at you and gesture angrily until you acknowledge them and, if you ignore them, will try very hard to get your attention. In the same way that it seems very important to some people who are angry that you know they're angry, many people seem to think it's very important that you know that they think that you're stupid and will keep increasing the intensity of their responses until you acknowledge that they think you're stupid.

One thing that might be worth noting is that I don't go out of my way to sound stupid or otherwise be non-conformist. If anything, it's the opposite. I generally try to conform in areas that aren't important to me when it's easy to conform, e.g., I dressed more casually in the office on the west coast than on the east coast since it's not important to me to convey some particular image based on how I dress and I'd rather spend my "weirdness points" on pushing radical ideas than on dressing unusually. After I changed how I dressed, one of the few people in the office who dressed really sharply in a way that would've been normal in the east coast office jokingly said to me, "so, the west coast got to you, huh?" and a few other people remarked that I looked a lot less stuffy/formal.

Another thing to note is that "avoiding looking stupid" seems to usually go beyond just filtering out comments or actions that might come off as stupid. Most people I talk to (and Ben is an exception here) have a real aversion evaluating stupid thoughts and (I'm guessing) also to having stupid thoughts. When I have an idea that sounds stupid, it's generally (and again, Ben is an exception here) extremely difficult to get someone to really consider the idea. Instead, most people reflexively reject the idea without really engaging with it at all and (I'm guessing) the same thing happens inside their heads when a potentially stupid sounding thought might occur to them. I think the danger here is not having a concious process that lets you decide to broadcast or not broadcast stupid sounding thoughts (that seems great if it's low overhead), and instead it's having some non-concious process automatically reject thinking about stupid sounding things.

Of course, stupid-sounding thoughts are frequently wrong, so, if you're not going to rely on social proof to filter out bad ideas, you'll have to hone your intuition or find trusted friends/colleagues who are able to catch your stupid-sounding ideas that are actually stupid. That's beyond the scope of this post. but I'll note that because almost no one attempts to hone their intuition for this kind of thing, it's very easy to get relatively good at it by just trying to do it at all.

Appendix: stories from other people

A disproportionate fraction of people whose work I really respect operate in a similar way to me with respect to looking stupid and also have a lot of stories about looking stupid.

One example from Laurence Tratt is from when he was job searching:

I remember being rejected from a job at my current employer because a senior person who knew me told other people that I was "too stupid". For a long time, I found this bemusing (I thought I must be missing out on some deep insights), but eventually I found it highly amusing, to the point I enjoy playing with it.

Another example: the other day, when I was talking to Gary Bernhardt, he told me a story about a time when he was chatting with someone who specialized in microservices on Kubernetes for startups and Gary said that he thought that most small (by transaction volume) startups could get away with being on a managed platform like Heroku or Google App Engine. The more Gary explained about his opinion, the more sure the person was that Gary was stupid.

Appendix: context

There are a lot of contexts that I'm not exposed to where it may be much more effective to train yourself to avoid looking stupid or incompetent, e.g., see this story by Ali Partovi about how his honesty led to Paul Graham's company being acquired by Yahoo instead of his own, which eventually led to Paul Graham founding YC and becoming one of the most well-known and influential people in the valley. If you're in a context where it's more important to look competent than to be competent then this post doesn't apply to you. Personally, I've tried to avoid such contexts, although they're probably more lucrative than the contexts I operate in.

Appendix: how to not care about looking stupid

This post has discussed what to do but not how to do it. Unfortunately, "how" is idiosyncratic and will vary greatly by person, so general advice here won't be effective. For myself, for better or for worse, this one came easy to me as I genuinely felt that I was fairly stupid during my formative years, so the idea that some random person thinks I'm stupid is like water off a duck's back.

It's hard to say why anyone feels a certain way about anything, but I'm going to guess that, for me, it was a combination of two things. First, my childhood friends were all a lot smarter than me. In the abstract, I knew that there were other kids out there who weren't obviously smarter than me but, weighted by interactions, most of my interactions were with my friends, which influenced how I felt more than reasoning about the distribution of people that were out there. Second, I grew up in a fairly abusive household and one of the minor things that went along with the abuse was regularly being yelled at, sometimes for hours on end, for being so shamefully, embarrassingly, stupid (I was in the same class as this kid and my father was deeply ashamed that I didn't measure up).

I wouldn't exactly recommend this path, but it seems to have worked out ok.

Thanks to Ben Kuhn, Laurence Tratt, Jeshua Smith, Niels Olson, Justin Blank, Tao L., Colby Russell, Anja Boskovic, David Coletta, @conservatif, and Ahmad Jarara for comments/corrections/discussion.

This happens in a way that I notice something like once a week and it seems like it must happen much more frequently in ways that I don't notice. ^[return]
A semi-recent example of this from my life is when I wanted to understand why wider tires have better grip. A naive reason one might think this is true is that wider tire = larger contact patch = more friction, and a lot of people seem to believe the naive reason. A reason the naive reason is wrong is because, as long as the tire is inflated semi-reasonably, given a fixed vehicle weight and tire pressure, the total size of the tire's contact patch won't change when tire width is changed. Another naive reason that the original naive reason is wrong is that, at a "spherical cow" level of detail, the level of grip is unrelated to the contact patch size.

Most people I talked who don't race cars (e.g., autocross, drag racing, etc.) and the top search results online used the refutation to the naive reason plus an incorrect application of high school physics to incorrectly conclude that varying tire width has no effect on grip.

But there is an effect and the reason is subtler than more width = larger contact patch.
^[return]
I was arguably #1 in the world one season, when I put up a statistically dominant performance and my team won every game I played even though I disproportionately played in games against other top teams (and we weren't undefeated and other top players on the team played in games we lost). ^[return]

What to learn

Mon, 18 Oct 2021 00:00:00 +0000

It's common to see people advocate for learning skills that they have or using processes that they use. For example, Steve Yegge has a set of blog posts where he recommends reading compiler books and learning about compilers. His reasoning is basically that, if you understand compilers, you'll see compiler problems everywhere and will recognize all of the cases where people are solving a compiler problem without using compiler knowledge. Instead of hacking together some half-baked solution that will never work, you can apply a bit of computer science knowledge to solve the problem in a better way with less effort. That's not untrue, but it's also not a reason to study compilers in particular because you can say that about many different areas of computer science and math. Queuing theory, computer architecture, mathematical optimization, operations research, etc.

One response to that kind of objection is to say that one should study everything. While being an extremely broad generalist can work, it's gotten much harder to "know a bit of everything" and be effective because there's more of everything over time (in terms of both breadth and depth). And even if that weren't the case, I think saying “should” is too strong; whether or not someone enjoys having that kind of breadth is a matter of taste. Another approach that can also work, one that's more to my taste, is to, as Gian Carlo Rota put it, learn a few tricks:

A long time ago an older and well known number theorist made some disparaging remarks about Paul Erdos' work. You admire contributions to mathematics as much as I do, and I felt annoyed when the older mathematician flatly and definitively stated that all of Erdos' work could be reduced to a few tricks which Erdos repeatedly relied on in his proofs. What the number theorist did not realize is that other mathematicians, even the very best, also rely on a few tricks which they use over and over. Take Hilbert. The second volume of Hilbert's collected papers contains Hilbert's papers in invariant theory. I have made a point of reading some of these papers with care. It is sad to note that some of Hilbert's beautiful results have been completely forgotten. But on reading the proofs of Hilbert's striking and deep theorems in invariant theory, it was surprising to verify that Hilbert's proofs relied on the same few tricks. Even Hilbert had only a few tricks!

If you look at how people succeed in various fields, you'll see that this is a common approach. For example, this analysis of world-class judo players found that most rely on a small handful of throws, concluding¹

Judo is a game of specialization. You have to use the skills that work best for you. You have to stick to what works and practice your skills until they become automatic responses.

If you watch an anime or a TV series "about" fighting, people often improve by increasing the number of techniques they know because that's an easy thing to depict but, in real life, getting better at techniques you already know is often more effective than having a portfolio of hundreds of "moves".

Relatedly, Joy Ebertz says:

One piece of advice I got at some point was to amplify my strengths. All of us have strengths and weaknesses and we spend a lot of time talking about ‘areas of improvement.’ It can be easy to feel like the best way to advance is to eliminate all of those. However, it can require a lot of work and energy to barely move the needle if it’s truly an area we’re weak in. Obviously, you still want to make sure you don’t have any truly bad areas, but assuming you’ve gotten that, instead focus on amplifying your strengths. How can you turn something you’re good at into your superpower?

I've personally found this to be true in a variety of disciplines. While it's really difficult to measure programmer effectiveness in anything resembling an objective manner, this isn't true of some things I've done, like competitive video games (a very long time ago at this point, back before there was "real" money in competitive gaming), the thing that took me from being a pretty decent player to a very good player was abandoning practicing things I wasn't particularly good at and focusing on increasing the edge I had over everybody else at the few things I was unusually good at.

This can work for games and sports because you can get better maneuvering yourself into positions that take advantage of your strengths as well as avoiding situations that expose your weaknesses. I think this is actually more effective at work than it is in sports or gaming since, unlike in competitive endeavors, you don't have an opponent who will try to expose your weaknesses and force you into positions where your strengths are irrelevant. If I study queuing theory instead of compilers, a rival co-worker isn't going to stop me from working on projects where queuing theory knowledge is helpful and leave me facing a field full of projects that require compiler knowledge.

One thing that's worth noting is that skills don't have to be things people would consider fields of study or discrete techniques. For the past three years, the main skill I've been applying and improving is something you might call "looking at data"; the term is in quotes because I don't know of a good term for it. I don't think it's what most people would think of as "statistics", in that I don't often need to do anything as sophisticated as logistic regression, let alone actually sophisticated. Perhaps one could argue that this is something data scientists do, but if I look at what I do vs. what data scientists we hire do as well as what we screen for in data scientist interviews, we don't appear to want to hire data scientists with the skill I've been working on nor do they do what I'm doing (this is a long enough topic that I might turn it into its own post at some point).

Unlike Matt Might or Steve Yegge, I'm not going to say that you should take a particular approach, but I'll say that working on a few things and not being particularly well rounded has worked for me in multiple disparate fields and it appears to work for a lot of other folks as well.

If you want to take this approach, this still leaves the question of what skills to learn. This is one of the most common questions I get asked and I think my answer is probably not really what people are looking for and not very satisfying since it's both obvious and difficult to put into practice.

For me, two ingredients for figuring out what to spend time learning are having a relative aptitude for something (relative to other things I might do, not relative to other people) and also having a good environment in which to learn. To say that someone should look for those things is so vague that's it's nearly useless, but it's still better than the usual advice, which boils down to "learn what I learned", which results in advice like "Career pro tip: if you want to get good, REALLY good, at designing complex and stateful distributed systems at scale in real-world environments, learn functional programming. It is an almost perfectly identical skillset." or the even more extreme claims from some language communities, like Chuck Moore's claim that Forth is at least 100x as productive as boring languages.

I took generic internet advice early in my career, including language advice (this was when much of this kind of advice was relatively young and it was not yet possible to easily observe that, despite many people taking advice like this, people who took this kind of advice were not particularly effective and people who are particularly effective were not likely to have taken this kind of advice). I learned Haskell, Lisp, Forth, etc. At one point in my career, I was on a two person team that implemented what might still be, a decade later, the highest performance Forth processor in existence (it was a 2GHz IPC-oriented processor) and I programmed it as well (there were good reasons for this to be a stack processor, so Forth seemed like as good a choice as any). Like Yossi Kreinin, I think I can say that I spent more effort than most people have becoming proficient in Forth, and like him, not only did I not find it find it to be a 100x productivity tool, it wasn't clear that it would, in general, even be 1x on productivity. To be fair, a number of other tools did better than 1x on productivity but, overall, I think following internet advice was very low ROI and the things that I learned that were high ROI weren't things people were recommending.

In retrospect, when people said things like "Forth is very productive", what I suspect they really meant was "Forth makes me very productive and I have not considered how well this generalizes to people with different aptitudes or who are operating in different contexts". I find it totally plausible that Forth (or Lisp or Haskell or any other tool or technique) does work very well for some particular people, but I think that people tend to overestimate how much something working for them means that it works for other people, making advice generally useless because it doesn't distinguish between advice that's aptitude or circumstance specific and generalizable advice, which is in stark contrast to fields where people actually discuss the pros and cons of particular techniques².

While a coach can give you advice that's tailored to you 1 on 1 or in small groups, that's difficult to do on the internet, which is why the best I can do here is the uselessly vague "pick up skills that are suitable for you". Just for example, two skills that clicked for me are "having an adversarial mindset" and "looking at data". A perhaps less useless piece of advice is that, if you're having a hard time identifying what those might be, you can ask people who know you very well, e.g., my manager and Ben Kuhn independently named coming up with solutions that span many levels of abstraction as a skill of mine that I frequently apply (and I didn't realize I was doing that until they pointed it out).

Another way to find these is to look for things you can't help but do that most other people don't seem to do, which is true for me of both "looking at data" and "having an adversarial mindset". Just for example, on having an adversarial mindset, when a company I was working for was beta testing a new custom bug tracker, I filed some of the first bugs on it and put unusual things into the fields to see if it would break. Some people really didn't understand why anyone would do such a thing and were baffled, disgusted, or horrified, but a few people (including the authors, who I knew wouldn't mind), really got it and were happy to see the system pushed past its limits. Poking at the limits of a system to see where it falls apart doesn't feel like work to me; it's something that I'd have to stop myself from doing if I wanted to not do it, which made spending a decade getting better at testing and verification techniques felt like something hard not to do and not work. Looking deeply into data is one I've spent more than a decade on at this point and it's another one that, to me, emotionally feels almost wrong to not improve at.

That these things are suited to me is basically due to my personality, and not something inherent about human beings. Other people are going to have different things that really feel easy/right for them, which is great, since if everyone was into looking at data and no one was into building things, that would be very problematic (although, IMO, looking at data is, on average, underrated).

The other major ingredient in what I've tried to learn is finding environments that are conducive to learning things that line up with my skills that make sense for me. Although suggesting that other people do the same sounds like advice that's so obvious that it's useless, based on how I've seen people select what team and company to work on, I think that almost nobody does this and, as a result, discussing this may not be completely useless.

An example of not doing this which typifies what I usually see is a case I just happened to find out about because I chatted with a manager about why their team had lost their new full-time intern conversion employee. I asked them about it since it was unusual for that manager to lose anyone since they're very good at retaining people and have low turnover on their teams. It turned out that their intern had wanted to work on infra, but had joined this manager's product team because they didn't know that they could ask to be on a team that matched their preferences. After the manager found out, the manager wanted the intern to be happy and facilitated a transfer to an infra team. In this case, this was a double whammy since the new hire doubly didn't consider working in an environment conducive for learning the skills they wanted. They made no attempt to work in the area they were interested in and then they joined a company that has a dysfunctional infra org that generally has poor design and operational practices, making the company a relatively difficult place to learn about infra on top of not even trying to land on an infra team. While that's an unusually bad example, in the median case that I've seen, people don't make decisions that result in particularly good outcomes with respect to learning even though good opportunities to learn are one of the top things people say that they want.

For example, Steve Yegge has noted:

The most frequently-asked question from college candidates is: "what kind of training and/or mentoring do you offer?" ... One UW interviewee just told me about Ford Motor Company's mentoring program, which Ford had apparently used as part of the sales pitch they do for interviewees. [I've elided the details, as they weren't really relevant. -stevey 3/1/2006] The student had absorbed it all in amazing detail. That doesn't really surprise me, because it's one of the things candidates care about most.

For myself, I was lucky that my first job, Centaur, was a great place to develop having an adversarial mindset with respect to testing and verification. When I compare what the verification team there accomplished, it's comparable to peer projects at other companies that employed much larger teams to do very similar things with similar or worse effectiveness, implying that the team was highly productive, which made that a really good place to learn.

Moreover, I don't think I could've learned as quickly on my own or by trying to follow advice from books or the internet. I think that people who are really good at something have too many bits of information in their head about how to do it for that information to really be compressible into a book, let alone a blog post. In sports, good coaches are able to convey that kind of information over time, but I don't know of anything similar for programming, so I think the best thing available for learning rate is to find an environment that's full of experts³.

For "looking at data", while I got a lot better at it from working on that skill in environments where people weren't really taking data seriously, the rate of improvement during the past few years, where I'm in an environment where I can toss ideas back and forth with people who are very good at understanding the limitations of what data can tell you as well as good at informing data analysis with deep domain knowledge, has been much higher. I'd say that I improved more at this in each individual year at my current job than I did in the decade prior to my current job.

One thing to perhaps note is that the environment, how you spend your day-to-day, is inherently local. My current employer is probably the least data driven of the three large tech companies I've worked for, but my vicinity is a great place to get better at looking at data because I spend a relatively large fraction of my time working with people who are great with data, like Rebecca Isaacs, and a relatively small fraction of the time working with people who don't take data seriously.

This post has discussed some strategies with an eye towards why they can be valuable, but I have to admit that my motivation for learning from experts wasn’t to create value. It's more that I find learning to be fun and there are some areas where I'm motivated enough to apply the skills regardless of the environment, and learning from experts is such a great opportunity to have fun that it's hard to resist. Doing this for a couple of decades has turned out to be useful, but that's not something I knew would happen for quite a while (and I had no idea that this would effectively transfer to a new industry until I changed from hardware to software).

A lot of career advice I see is oriented towards career or success or growth. That kind of advice often tells people to have a long-term goal or strategy in mind. It will often have some argument that's along the lines of "a random walk will only move you sqrt(n) in some direction whereas a directed walk will move you n in some direction". I don't think that's wrong, but I think that, for many people, that advice implicitly underestimates the difficulty of finding an area that's suited to you⁴, which I've basically done by trial and error.

Appendix: parts of the problem this post doesn't discuss in detail

One major topic not discussed is how to balance what "level" of skill to work on, which could be something high level, like "looking at data", to something lower level, like "Bayesian multilevel models", to something even lower level, like "typing speed". That's a large enough topic that it deserves its own post that I'd expect to be longer than this one but, for now, here's a comment from Gary Bernhardt about something related that I believe also applies to this topic.

Another major topic that's not discussed here is picking skills that are relatively likely to be applicable. It's a little too naive to just say that someone should think about learning skills they have an aptitude for without thinking about applicability.

But while it's pretty easy to pick out skills where it's very difficult to either have an impact on the world or make a decent amount of money or achieve whatever goal you might want to achieve, like "basketball" or "boxing", it's harder to pick between plausible skills, like computer architecture vs. PL.

But I think semi-reasonable sounding skills are likely enough to be high return if they're a good fit for someone that trial and error among semi-reasonable sounding skills is fine, although it probably helps to be able to try things out quickly

Ben Kuhn on, in some sense, what it's like to really learn something
Holden Karnofsky on having an aptitude-first approach to careers instead of a career-path-first approach, which is sort of analogous to thinking about cross cutting skills like "looking at data" or "having an adversarial mindset" and not just thinking about skills like "compilers" or "queuing theory"
Peter Drucker on how to understand one's strengths and weaknesses and do work that compatible with ones own inclinations
Alexy Guzey on the effectiveness of advice
Edward Kmett with another perspective on how to think about learning
Patrick Collison on how to maximize useful learning and find what you'll enjoy

Thanks to Ben Kuhn, Alexey Guzey, Marek Majkowski, Nick Bergson-Shilcock, @bekindtopeople2, Aaron Levin, Milosz Danczak, Anja Boskovic, John Doty, Justin Blank, Mark Hansen, "wl", and Jamie Brandon for comments/corrections/discussion.

This is an old analysis. If you were to do one today, you'd see a different mix of throws, but it's still the case that you see specialists having a lot of success, e.g., Riner with osoto gari ^[return]
To be fair to blanket, context free, advice, to learn a particular topic, functional programming really clicked for me and I could imagine that, if that style of thinking wasn't already natural for me (as a result of coming from a hardware background), the advice that one should learn functional programming because it will change how you think about problems might've been useful for me, but on the other hand, that means that the advice could've just as easily been to learn hardware engineering. ^[return]
I don't have a large enough sample nor have I polled enough people to have high confidence that this works as a general algorithm but, for finding groups of world-class experts, what's worked for me is finding excellent managers. The two teams I worked on with the highest density of world-class experts have been teams under really great management. I have a higher bar for excellent management than most people and, from having talked to many people about this, almost no one I've talked to has worked for or even knows a manager as good as one I would consider to be excellent (and, general, both the person I'm talking to agrees with me on this, indicating that it's not the case that they have a manager who's excellent in dimensions I don't care about and vice versa); from discussions about this, I would guess that a manager I think of as excellent is at least 99.9%-ile. How to find such a manager is a long discussion that I might turn into another post.

Anyway, despite having a pretty small sample on this, I think the mechanism for this is plausible, in that the excellent managers I know have very high retention as well as a huge queue of people who want to work for them, making it relatively easy for them to hire and retain people with world-class expertise since the rest of the landscape is so bleak.

A more typical strategy, one that I don't think generally works and also didn't work great for me when I tried it is to work on the most interesting sounding and/or hardest problems around. While I did work with some really great people while trying to work on interesting / hard problems, including one of the best engineers I've ever worked with, I don't think that worked nearly as well as looking for good management w.r.t. working with people I really want to learn from. I believe the general problem with this algorithm is the same problem with going to work in video games because video games are cool and/or interesting. The fact that so many people want to work on exciting sounding problems leads to dysfunctional environments that can persist indefinitely.

In one case, I was on a team that had 100% turnover in nine months and it would've been six if it hadn't taken so long for one person to find a team to transfer to. In the median case, my cohort (people who joined around when I joined, ish) had about 50% YoY turnover and I think that people had pretty good reasons for leaving. Not only is this kind of turnover a sign that the environment is often a pretty unhappy one, these kinds of environments often differentially cause people who I'd want to work with and/or learn from to leave. For example, on the team I was on where the TL didn't believe in using version control, automated testing, or pipelined designs, I worked with Ikhwan Lee, who was great. Of course, Ikhwan left pretty quickly while the TL stayed and is still there six years later.
^[return]
Something I've seen many times among my acquaintances is that people will pick a direction before they have any idea whether or not it's suitable for them. Often, after quite some time (more than a decade in some cases), they'll realize that they're actually deeply unhappy with the direction they've gone, sometimes because it doesn't match their temperament, and sometimes because it's something they're actually bad at. In any case, wandering around randomly and finding yourself sqrt(n) down a path you're happy with doesn't seem so bad compared to having made it n down a path you're unhappy with. ^[return]

Some reasons to work on productivity and velocity

Fri, 15 Oct 2021 00:00:00 +0000

A common topic of discussion among my close friends is where the bottlenecks are in our productivity and how we can execute more quickly. This is very different from what I see in my extended social circles, where people commonly say that velocity doesn't matter. In online discussions about this, I frequently see people go a step further and assign moral valence to this, saying that it is actually bad to try to increase velocity or be more productive or work hard (see appendix for more examples).

The top reasons I see people say that productivity doesn't matter (or is actually bad) fall into one of three buckets:

Working on the right thing is more important than working quickly
Speed at X doesn't matter because you don't spend much time doing X
Thinking about productivity is bad and you should "live life"

I certainly agree that working on the right thing is important, but increasing velocity doesn't stop you from working on the right thing. If anything, each of these is a force multiplier for the other. Having strong execution skills becomes more impactful if you're good at picking the right problem and vice versa.

It's true that the gains from picking the right problem can be greater than the gains from having better tactical execution because the gains from picking the right problem can be unbounded, but it's also much easier to improve tactical execution and doing so also helps with picking the right problem because having faster execution lets you experiment more quickly, which helps you find the right problem.

A concrete example of this is a project I worked on to quantify the machine health of the fleet. The project discovered a number of serious issues (a decent fraction of hosts were actively corrupting data or had a performance problem that would increase tail latency by > 2 orders of magnitude, or both). This was considered serious enough that a new team was created to deal with the problem.

In retrospect, my first attempts at quantifying the problem were doomed and couldn't have really worked (or not in a reasonable amount of time, anyway). I spent a few weeks cranking through ideas that couldn't work and a critical part of getting to the idea that did work after "only" a few weeks was being able to quickly try out and discard ideas that didn't work. In part of a previous post, I described how long a tiny part of that process took and multiple people objected to that being impossibly fast in internet comments.

I find this a bit funny since I'm not a naturally quick programmer. Learning to program was a real struggle for me and I was pretty slow at it for a long time (and I still am in aspects that I haven't practiced). My "one weird trick" is that I've explicitly worked on speeding up things that I do frequently and most people have not. I view the situation as somewhat analogous to sports before people really trained. For a long time, many athletes didn't seriously train, and then once people started trying to train, the training was often misguided by modern standards. For example, if you read commentary on baseball from the 70s, you'll see people saying that baseball players shouldn't weight train because it will make them "muscle bound" (many people thought that weight lifting would lead to "too much" bulk, causing people to be slower, have less explosive power, and be less agile). But today, players get a huge advantage from using performance-enhancing drugs that increase their muscle-bound-ness, which implies that players could not get too "muscle bound" from weight training alone. An analogous comment to one discussed above would be saying that athletes shouldn't worry about power/strength and should increase their skill, but power increases returns to skill and vice versa.

Coming back to programming, if you explicitly practice and train and almost no one else does, you'll be able to do things relatively quickly compared to most people even if, like me, you don't have much talent for programming and getting started at all was a real struggle. Of course, there's always going to be someone more talented out there who's executing faster after having spent less time improving. But, luckily for me, relatively few people seriously attempt to improve, so I'm able to do ok.

Anyway, despite operating at a rate that some internet commenters thought was impossible, it took me weeks of dead ends to find something that worked. If I was doing things at a speed that people thought was normal, I suspect it would've taken long enough to find a feasible solution that I would've dropped the problem after spending maybe one or two quarters on it. The number of plausible-ish seeming dead ends was probably not unrelated to why the problem was still an open problem despite being a critical issue for years. Of course, someone who's better at having ideas than me could've solved the problem without the dead ends, but as we discussed earlier, it's fairly easy to find low hanging fruit on "execution speed" and not so easy to find low hanging fruit on "having better ideas". However, it's possible to, to a limited extent, simulate someone who has better ideas than me by being able to quickly try out and discard ideas (I also work on having better ideas, but I think it makes sense to go after the easier high ROI wins that are available as well). Being able to try out ideas quickly also improves the rate at which I can improve at having better ideas since a key part of that is building intuition by getting feedback on what works.

The next major objection is that speed at a particular task doesn't matter because time spent on that task is limited. At a high level, I don't agree with this objection because, while this may hold true for any particular kind of task, the solution to that is to try to improve each kind of task and not to reject the idea of improvement outright. A sub-objection people have is something like "but I spend 20 hours in unproductive meetings every week, so it doesn't matter what I do with my other time". I think this is doubly wrong, in that if you then only have 20 hours of potentially productive time, whatever productivity multiplier you have on that time still holds for your general productivity. Also, it's generally possible to drop out of meetings that are a lost cause and increase the productivity of meetings that aren't a lost cause¹.

More generally, when people say that optimizing X doesn't help because they don't spend time on X and are not bottlenecked on X, that doesn't match my experience as I find I spend plenty of time bottlenecked on X for commonly dismissed Xs. I think that part of this is because getting faster at X can actually increase time spent on X due to a sort of virtuous cycle feedback loop of where it makes sense to spend time. Another part of this is illustrated in this comment by Fabian Giesen:

It is commonly accepted, verging on a cliche, that you have no idea where your program spends time until you actually profile it, but the corollary that you also don't know where you spend your time until you've measured it is not nearly as accepted.

When I've looked how people spend time vs. how people think they spend time, it's wildly inaccurate and I think there's a fundamental reason that, unless they measure, people's estimates of how they spend their time tends to be way off, which is nicely summed in by another Fabian Giesen quote, which happens to be about solving Rubik's cubes but applies to other cognitive tasks:

Paraphrasing a well-known cuber, "your own pauses never seem bad while you're solving, because your brain is busy and you know what you're thinking about, but once you have a video it tends to become blindingly obvious what you need to improve". Which is pretty much the usual "don't assume, profile" advice for programs, but applied to a situation where you're concentrated and busy for the entire time, whereas the default assumption in programming circles seems to be that as long as you're actually doing work and not distracted or slacking off, you can't possibly be losing a lot of time

Unlike most people who discuss this topic online, I've actually looked at where my time goes and a lot of it goes to things that are canonical examples of things that you shouldn't waste time improving because people don't spend much time doing them.

An example of one of these, the most commonly cited bad-thing-to-optimize example that I've seen, is typing speed (when discussing this, people usually say that typing speed doesn't matter because more time is spent thinking than typing). But, when I look at where my time goes, a lot of it is spent typing.

A specific example is that I've written a number of influential docs at my current job and when people ask how long some doc took to write, they're generally surprised that the doc only took a day to write. As with the machine health example, a thing that velocity helps with is figuring out which docs will be influential. If I look at the docs I've written, I'd say that maybe 15% were really high impact (caused a new team to be created, changed the direction of existing teams, resulted in significant changes to the company's bottom line, etc.). Part of it is that I don't always know which ideas will resonate with other people, but part of it is also that I often propose ideas that are long shots because the ideas sound too stupid to be taken seriously (e.g., one of my proposed solutions to a capacity crunch was to, for each rack, turn off 10% of it, thereby increasing effective provisioned capacity, which is about as stupid sounding an idea as one could come up with). If I was much slower at writing docs, it wouldn't make sense to propose real long shot ideas. As things are today, if I think an idea has a 5% chance of success, in expectation, I need to spend ~20 days writing docs to have one of those land.

I spend roughly half my writing time typing. If I typed at what some people say median typing speed is (40 WPM) instead of the rate some random typing test clocked me at (110 WPM), this would be a 0.5 + 0.5 * 110/40 = 1.875x slowdown, putting me at nearly 40 days of writing before a longshot doc lands, which would make that a sketchier proposition. If I hadn't optimized the non-typing part of my writing workflow as well, I think I would be, on net, maybe 10x slower², which would put me at more like ~200 days per high impact longshot doc, which is enough that I think that I probably wouldn't write longshot docs³.

More generally, Fabian Giesen has noted that this kind of non-linear impact of velocity is common:

There are "phase changes" as you cross certain thresholds (details depend on the problem to some extent) where your entire way of working changes. ... There's a lot of things I could in theory do at any speed but in practice cannot, because as iteration time increases it first becomes so frustrating that I can't do it for long and eventually it takes so long that it literally drops out of my short-term memory, so I need to keep notes or otherwise organize it or I can't do it at all.

Certainly if I can do an experiment in an interactive UI by dragging on a slider and see the result in a fraction of a second, at that point it's very "no filter", if you want to try something you just do it.

Once you're at iteration times in the low seconds (say a compile-link cycle with a statically compiled lang) you don't just try stuff anymore, you also spend time thinking about whether it's gonna tell you anything because it takes long enough that you'd rather not waste a run.

Once you get into several-minute or multi-hour iteration times there's a lot of planning to not waste runs, and context switching because you do other stuff while you wait, and note-taking/bookkeeping; also at this level mistakes are both more expensive (because a wasted run wastes more time) and more common (because your attention is so divided).

As you scale that up even more you might now take significant resources for a noticeable amount of time and need to get that approved and budgeted, which takes its own meetings etc.

A specific example of something moving from one class of item to another in my work was this project on metrics analytics. There were a number of proposals on how to solve this problem. There was broad agreement that the problem was important with no dissenters, but the proposals were all the kinds of things you'd allocate a team to work on through multiple roadmap cycles. Getting a project that expensive off the ground requires a large amount of organizational buy-in, enough that many important problems don't get solved, including this one. But it turned out, if scoped properly and executed reasonably, the project was actually something a programmer could create an MVP of in a day, which takes no organizational buy-in to get off the ground. Instead of needing to get multiple directors and a VP to agree that the problem is among the org's most important problems, you just need a person who thinks the problem is worth solving.

Going back to Xs where people say velocity doesn't matter because they don't spend a lot time on X, another one I see frequently is coding, and it is also not my personal experience that coding speed doesn't matter. For the machine health example discussed above, after I figured out something that would work, I spent one month working on basically nothing but that, coding, testing, and debugging. I think I had about 6 hours of meetings during that month, but other than that plus time spent eating, etc., I would go in to work, code all day, and then go home. I think it's much more difficult to compare coding speed across people because it's rare to see people do the same or very similar non-trivial tasks, so I won't try to compare to anyone else, but if I look at my productivity before I worked on improving it as compared to where I'm at now, the project probably would have been infeasible without the speedups I've found by looking at my velocity.

Amdahl's law based arguments can make sense when looking for speedups in a fixed benchmark, like a sub-task of SPECint, but when you have a system where getting better at a task increases returns to doing that task and can increase time spent on the task, it doesn't make sense to say that you shouldn't work on something because you spend a lot of time doing it. I spend time on things that are high ROI, but those things are generally only high ROI because I've spent time improving my velocity, which reduces the "I" in ROI.

The last major argument I see against working on velocity assigns negative moral weight to the idea of thinking about productivity and working on velocity at all. This kind of comment often assigns positive moral weight to various kinds of leisure, such as spending time with friends and family. I find this argument to be backwards. If someone thinks it's important to spend time with friends and family, an easy way to do that is to be more productive at work and spend less time working.

Personally, I deliberately avoid working long hours and I suspect I don't work more than the median person at my company, which is a company where I think work-life balance is pretty good overall. A lot of my productivity gains have gone to leisure and not work. Furthermore, deliberately working on velocity has allowed me to get promoted relatively quickly⁴, which means that I make more money than I would've made if I didn't get promoted, which gives me more freedom to spend time on things that I value.

For people that aren't arguing that you shouldn't think about productivity because it's better to focus on leisure and instead argue that you simply shouldn't think about productivity at all because it's unnatural and one should live a natural life, that ultimately comes down to personal preference, but for me, I value the things I do outside of work too much to not explicitly work on productivity at work.

As with this post on reasons to measure, while this post is about practical reasons to improve productivity, the main reason I'm personally motivated to work on my own productivity isn't practical. The main reason is that I enjoy the process of getting better at things, whether that's some nerdy board game, a sport I have zero talent at that will never have any practical value to me, or work. For me, a secondary reason is that, given that my lifespan is finite, I want to allocate my time to things that I value, and increasing productivity allows me to do more of that, but that's not a thought i had until I was about 20, at which point I'd already been trying to improve at most things I spent significant time on for many years.

Another common reason for working on productivity is that mastery and/or generally being good at something seems satisfying for a lot of people. That's not one that resonates with me personally, but when I've asked other people about why they work on improving their skills, that seems to be a common motivation.

A related idea, one that Holden Karnofsky has been talking about for a while, is that if you ever want to make a difference in the world in some way, it's useful to work on your skills even in jobs where it's not obvious that being better at the job is useful, because the developed skills will give you more leverage on the world when you switch to something that's more aligned with what you want to achieve.

Appendix: one way to think about what to improve

Here's a framing I like from Gary Bernhardt (not set off in a quote block since this entire section, other than this sentence, is his).

People tend to fixate on a single granularity of analysis when talking about efficiency. E.g., "thinking is the most important part so don't worry about typing speed". If we step back, the response to that is "efficiency exists at every point on the continuum from year-by-year strategy all the way down to millisecond-by-millisecond keystrokes". I think it's safe to assume that gains at the larger scale will have the biggest impact. But as we go to finer granularity, it's not obvious where the ROI drops off. Some examples, moving from coarse to fine:

The macro point that you started with is: programming isn't just thinking; it's thinking plus tactical activities like editing code. Editing faster means more time for thinking.
But editing code costs more than just the time spent typing! Programming is highly dependent on short-term memory. Every pause to edit is a distraction where you can forget the details that you're juggling. Slower editing effectively weakens your short-term memory, which reduces effectiveness.
But editing code isn't just hitting keys! It's hitting keys plus the editor commands that those keys invoke. A more efficient editor can dramatically increase effective code editing speed, even if you type at the same WPM as before.
But each editor command doesn't exist in a vacuum! There are often many ways to make the same edit. A Vim beginner might type "hhhhxxxxxxxx" when "bdw" is more efficient. An advanced Vim user might use "bdw", not realizing that it's slower than "diw" despite having the same number of keystrokes. (In QWERTY keyboard layout, the former is all on the left hand, whereas the latter alternates left-right-left hands. At 140 WPM, you're typing around 14 keystrokes per second, so each finger only has 70 ms to get into position and press the key. Alternating hands leaves more time for the next finger to get into position while the previous finger is mid-keypress.)

We have to choose how deep to go when thinking about this. I think that there's clear ROI in thinking about 1-3, and in letting those inform both tool choice and practice. I don't think that (4) is worth a lot of thought. It seems like we naturally find "good enough" points there. But that also makes it a nice fence post to frame the others.

Appendix: more examples

Velocity doesn't matter, from Julia Evans, who I believe has been the most widely read programming blogger since about 2015
In the comments on a post where Ben Kuhn notes that he got 50% more productive by allocating his time better, people are nearly uniformly negative about the post and say that he works too much. Although Ben clarified in multiple comments as well as in the post that not all time tracked was worked, the commenters are too busy taking the moral high ground to actually respond to the contents of the post
Comments on Jamie Brandon's "Speed Matters"
- Working quickly is pointless because you will be forced to do more work
- Speed doesn't matter if you're doing the right thing, and also, if such a thing as speed did exist, it would be unmeasurable and therefore pointless to discuss
- Thinking about productivity is unhealthy. One should relax instead
- You can only choose 2 of "good, fast, cheap", therefore it is counterproductive to work on speed
- A large speedup is impossible
- "The author mistakes coding for typing"
- etc.
- As with Ben's post, virtually all of these comments are addressed in the post itself. I'm going to stop noting when this is true because it is generally true of the posts referred to here.
The #3 comment on a post by Michael Malis on "How to Improve Your Productivity as a Working Programmer ": "Fuck it, the entire work environment seems designed to decrease productivity . . . Why should I bother . . ."
- #4 comment: "What if I don't want to improve my productivity ? Just take time."
  - After the initial indignation, this comment goes on and proves that the commenter missed the point entirely, as the rest of the comment explains how the commenter works productively, which the commenter apparently is ok with as long as it's not phrased as a way to work productively, because one is supposed to be morally outraged by someone wanting to be productive and sharing techniques about how to be productive with other people who might be interested in being productive
  - In the responses, someone points out that someone who's more productive would be able to spend more time on leisure; that comment is uniformly panned because "work expands so as to fill the time available for its completion", as if how one spends time is some sort of immutable law of nature and not something under anyone's control
- Another comment: "Alright. What are we optimizing for? Productivity? Or the end-goals of any of: achieving more, climbing the corporate ladder, making more money, etc..?"
Comments on a post by antirez about productivity
- The article is talking about the 10x programmer universe, not the normal universe most people live in
- It's pointless to work on productivity since your environment determines productivity
- Productive programmers are selfish, don't mentor, etc., and are bad for their teams because their increased productivity always comes from neglecting more important things, so anyone who's productive as a programmer is actually counterproductive for the team
  - If you read the entire comments to the post, you'll see that this is a common theme
Comments on Alexy Guezy's thoughts on productivity
- "Serious question: Is anything less productive than reading other people's productivity thoughts? It's a combination of procrastination and finding out what works for someone who is presumably more productive than you (ie: guilt)."
- An anti-productivity article titled "Against Productivity
Typing speed doesn't matter because you only spend 0.5% to 1% of your time typing
- Despite the talk about 8-hour work days, I think people who get 4 hours of real work in a day are generally considered extremely productive. 0.5% to 1% of 4 hours is 1.2 minutes to 2.4 minutes a day or, for someone who types 100 wpm, 240 total words across slack, JIRA, email, actual code, commit messages, design docs, comments on design docs, documentation, etc.; I don't believe I know any professional programmers who type that little
"I feel like there is a correlation between fast-twitch programming muscles and technical debt . . . but we were all young once, I remember thinking the only thing holding me back was 4.77MHz ", a comment on a blog post benchmarking build times on different machines (where the post has nothing resembling the idea that the only thing holding back developers is build times)

etc.

Some positive examples of people who have used their productivity to "fund" things that they value include Andy Kelley (Zig), Jamie Brandon (various), Andy Matuschak (mnemonic medium, various), Saul Pwanson (VisiData), Andy Chu (Oil Shell). I'm drawing from programming examples, but you can find plenty of others, e.g., Nick Adnitt (Darkside Canoes) and, of course, numerous people who've retired to pursue interests that aren't work-like at all.

Appendix: another reason to avoid being productive

An idea that's become increasingly popular in my extended social circles at major tech companies is that one should avoid doing work and waste as much time as possible, often called "antiwork", which seems like a natural extension of "tryhard" becoming an insult. The reason given is often something like, work mainly enriches upper management at your employer and/or shareholders, who are generally richer than you.

I'm sympathetic to the argument and agree that upper management and shareholders capture most of the value from work. But as much as I sympathize with the idea of deliberately being unproductive to "stick it to the man", I value spending my time on things that I want enough that I'd rather get my work done quickly so I can do things I enjoy more than work. Additionally, having been productive in the past has given me good options for jobs, so I have work that I enjoy a lot more than my acquaintances in tech who have embraced the "antiwork" movement.

The less control you have over your environment, the more it makes sense to embrace "antiwork". Programmers at major tech companies have, relatively speaking, a lot of control over their environment, which is why I'm not "antiwork" even though I'm sympathetic to the cause.

Although it's about a different topic, a related comment from Prachee Avasthi about avoiding controversial work and avoiding pushing for necessary changes when pre-tenure ingrains habits that are hard break post-tenure. If one wants to be "antiwork" forever, that's not a problem, but if one wants to move the needle on something at some point, building "antiwork" habits while working for a major tech company will instill counterproductive habits.

Thanks to Fabian Giesen, Gary Bernhardt, Ben Kuhn, David Turner, Marek Majkowski, Anja Boskovic, Aaron Levin, Lifan Zeng, Justin Blank, Heath Borders, Tao L., Nehal Patel, @chozu@fedi.absturztau.be, Alex Allain, and Jamie Brandon for comments/corrections/discussion

When I look at the productiveness of meetings, there are some people who are very good at keeping meetings on track and useful. For example, one person who I've been in meetings with who is extraordinarily good at ensuring meetings are productive is Bonnie Eisenman. Early on in my current job, I asked her how she was so effective at keeping meetings productive and have been using that advice since then (I'm not nearly as good at it as she is, but even so, improving at this was a significant win for me). ^[return]
10x might sound like an implausibly large speedup on writing, but in a discussion on writing speed on a private slack, a well-known newsletter author mentioned that their net writing speed for a 5k word newsletter was a little under 2 words per minute (WPM). My net rate (including time spent editing, etc.) is over 20 WPM per doc.

With a measured typing speed of 110 WPM, that might sound like I spend a small fraction of my time typing, but it turns out it's roughly half the time. If I look at my writing speed, it's much slower than my typing test speed and it seems that it's perhaps half the rate. If I look at where the actual time goes, roughly half of it goes to typing and half goes to thinking, semi-serially, which creates long pauses in my typing.

If I look at where the biggest win here could come, it would be from thinking and typing in parallel, which is something I'd try to achieve by practicing typing more, not less. But even without being able to do that, and with above average typing speed, I still spend half of my time typing!

The reason my net speed is well under the speed that I write is that I do multiple passes and re-write. Some time is spent reading as I re-write, but I read much more quickly than I write, so that's a pretty small fraction of time. In principle, I could adopt an approach that involves less re-writing, but I've tried a number of things that one might expect would lead to that goal and haven't found one that works for me (yet?).

Although the example here is about work, this also holds for my personal blog, where my velocity is similar. If I wrote ten times slower than I do, I don't think I'd have much of a blog. My guess is that I would've written a few posts or maybe even a few drafts and not gotten to the point where I'd post and then stop.

I enjoy writing and get a lot of value out of it in a variety of ways, but I value the other things in my life enough that I don't think writing would have a place in my life if my net writing speed were 2 WPM.
^[return]
Another strategy would be to write shorter docs. There's a style of doc where that works well, but I frequently write docs where I leverage my writing speed to discuss a problem that would be difficult to convincingly discuss without a long document.

One example of a reason that my docs is that I frequently work on problems that span multiple levels of the stack, which means that I end up presenting data from multiple levels of the stack as well as providing enough context about why the problem at some level drives a problem up or down the stack for people who aren't deeply familiar with that level of the stack, which is necessary since few readers will have strong familiarity with every level needed to understand the problem.

In most cases, there have been previous attempts to motivate/fund work on the problem that didn't get traction because there wasn't a case linking an issue at one level of the stack to important issues at other levels of the stack. I could avoid problems that span many levels of the stack, but there's a lot of low hanging fruit among those sorts of problems for technical and organizational reasons, so I don't think it makes sense to ignore them just because it takes a day to write a document explaining the problem (although it might make sense if it took ten days, at least in cases where people might be skeptical of the solution).
^[return]
Of course, promotions are highly unfair and being more productive doesn't guarantee promotion. If I just look at what things are correlated with level, it's not even clear to me that productivity is more strongly correlated with level than height, but among factors that are under my control, productivity is one of the easiest to change. ^[return]

The value of in-house expertise

Wed, 29 Sep 2021 00:00:00 +0000

An alternate title for this post might be, "Twitter has a kernel team!?". At this point, I've heard that surprised exclamation enough that I've lost count of the number times that's been said to me (I'd guess that it's more than ten but less than a hundred). If we look at trendy companies that are within a couple factors of two in size of Twitter (in terms of either market cap or number of engineers), they mostly don't have similar expertise, often as a result of path dependence — because they "grew up" in the cloud, they didn't need kernel expertise to keep the lights on the way an on prem company does. While that makes it socially understandable that people who've spent their career at younger, trendier, companies, are surprised by Twitter having a kernel team, I don't think there's a technical reason for the surprise.

Whether or not it has kernel expertise, a company Twitter's size is going to regularly run into kernel issues, from major production incidents to papercuts. Without a kernel team or the equivalent expertise, the company will muddle through the issues, running into unnecessary problems as well as taking an unnecessarily long time to mitigate incidents. As an example of a critical production incident, just because it's already been written up publicly, I'll cite this post, which dryly notes:

Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug

What this implies but doesn't explicitly say is that this firewall misconfiguration was the most severe incident that's occured during my time at Twitter and I believe it's actually the most severe outage that Twitter has had since 2013 or so. As a company, we would've still been able to mitigate the issue without a kernel team or another team with deep Linux expertise, but it would've taken longer to understand why the initial fix didn't work, which is the last thing you want when you're debugging a serious outage. Folks on the kernel team were already familiar with the various diagnostic tools and debugging techniques necessary to quickly understand why the initial fix didn't work, which is not common knowledge at some peer companies (I polled folks at a number of similar-scale peer companies to see if they thought they had at least one person with the knowledge necessary to quickly debug the bug and the answer was no at many companies).

Another reason to have in-house expertise in various areas is that they easily pay for themselves, which is a special case of the generic argument that large companies should be larger than most people expect because tiny percentage gains are worth a large amount in absolute dollars. If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity, and Twitter’s kernel team has found many such changes. In addition to kernel patches that sometimes have that kind of impact, people will also find configuration issues, etc., that have that kind of impact.

So far, I've only talked about the kernel team because that's the one that most frequently elicits surprise from folks for merely existing, but I get similar reactions when people find out that Twitter has a bunch of ex-Sun JVM folks who worked on HotSpot, like Ramki Ramakrishna, Tony Printezis, and John Coomes. People wonder why a social media company would need such deep JVM expertise. As with the kernel team, companies our size that use the JVM run into weird issues and JVM bugs and it's helpful to have people with deep expertise to debug those kinds of issues. And, as with the kernel team, individual optimizations to the JVM can pay for the team in perpetuity. A concrete example is this patch by Flavio Brasil, which virtualizes compare and swap calls.

The context for this is that Twitter uses a lot of Scala. Despite a lot of claims otherwise, Scala uses more memory and is significantly slower than Java, which has a significant cost if you use Scala at scale, enough that it makes sense to do optimization work to reduce the performance gap between idiomatic Scala and idiomatic Java.

Before the patch, if you profiled our Scala code, you would've seen an unreasonably large amount of time spent in Future/Promise, including in cases where you might naively expect that the compiler would optimize the work away. One reason for this is that Futures use a compare-and-swap (CAS) operation that's opaque to JVM optimization. The patch linked above avoids CAS operations when the Future doesn't escape the scope of the method. This companion patch removes CAS operations in some places that are less amenable to compiler optimization. The two patches combined reduced the cost of typical major Twitter services using idiomatic Scala by 5% to 15%, paying for the JVM team in perpetuity many times over and that wasn't even the biggest win Flavio found that year.

I'm not going to do a team-by-team breakdown of teams that pay for themselves many times over because there are so many of them, even if I limit the scope to "teams that people are surprised that Twitter has".

A related topic is how people talk about "buy vs. build" discussions. I've seen a number of discussions where someone has argued for "buy" because that would obviate the need for expertise in the area. This can be true, but I've seen this argued for much more often than it is true. An example where I think this tends to be untrue is with distributed tracing. We've previously looked at some ways Twitter gets value out of tracing, which came out of the vision Rebecca Isaacs put into place. On the flip side, when I talk to people at peer companies with similar scale, most of them have not (yet?) succeeded at getting significant value from distributed tracing. This is so common that I see a viral Twitter thread about how useless distributed tracing is more than once a year. Even though we went with the more expensive "build" option, just off the top of my head, I can think of multiple uses of tracing that have returned between 10x and 100x the cost of building out tracing, whereas people at a number of companies that have chosen the cheaper "buy" option commonly complain that tracing isn't worth it.

Coincidentally, I was just talking about this exact topic to Pam Wolf, a civil engineering professor with experience in (civil engineering) industry on multiple continents, who had a related opinion. For large scale systems (projects), you need an in-house expert (owner's-side engineer) for each area that you don't handle in your own firm. While it's technically possible to hire yet another firm to be the expert, that's more expensive than developing or hiring in-house expertise and, in the long run, also more risky. That's pretty analogous to my experience working as an electrical engineer as well, where orgs that outsource functions to other companies without retaining an in-house expert pay a very high cost, and not just monetarily. They often ship sub-par designs with long delays on top of having high costs. "Buying" can and often does reduce the amount of expertise necessary, but it often doesn't remove the need for expertise.

This related to another common abstract argument that's commonly made, that companies should concentrate on "their area of comparative advantage" or "most important problems" or "core business need" and outsource everything else. We've already seen a couple of examples where this isn't true because, at a large enough scale, it's more profitable to have in-house expertise than not regardless of whether or not something is core to the business (one could argue that all of the things that are moved in-house are core to the business, but that would make the concept of coreness useless). Another reason this abstract advice is too simplistic is that businesses can somewhat arbitrarily choose what their comparative advantage is. A large¹ example of this would be Apple bringing CPU design in-house. Since acquiring PA Semi (formerly the team from SiByte and, before that, a team from DEC) for $278M, Apple has been producing the best chips in the phone and laptop power envelope by a pretty large margin. But, before the purchase, there was nothing about Apple that made the purchase inevitable, that made CPU design an inherent comparative advantage of Apple. But if a firm can pick an area and make it an area of comparative advantage, saying that the firm should choose to concentrate on its comparative advantage(s) isn't very helpful advice.

$278M is a lot of money in absolute terms, but as a fraction of Apple's resources, that was tiny and much smaller companies also have the capability to do cutting edge work by devoting a small fraction of their resources to it, e.g., Twitter, for a cost that any $100M company could afford, created novel cache algorithms and data structures and is doing other cutting edge cache work. Having great cache infra isn't any more core to Twitter's business than creating a great CPU is to Apple's, but it is a lever that Twitter can use to make more money than it could otherwise.

For small companies, it doesn't make sense to have in-house experts for everything the company touches, but companies don't have to get all that large before it starts making sense to have in-house expertise in their operating system, language runtime, and other components that people often think of as being fairly specialized. Looking back at Twitter's history, Yao Yue has noted that when she was working on cache in Twitter's early days (when we had ~100 engineers), she would regularly go to the kernel team for help debugging production incidents and that, in some cases, debugging could've easily taken 10x longer without help from the kernel team. Social media companies tend to have relatively high scale on a per-user and per-dollar basis, so not every company is going to need the same kind of expertise when they have 100 engineers, but there are going to be other areas that aren't obviously core business needs where expertise will pay off even for a startup that has 100 engineers.

Thanks to Ben Kuhn, Yao Yue, Pam Wolf, John Hergenroeder, Julien Kirch, Tom Brearley, and Kevin Burke for comments/corrections/discussion.

Some other large examples of this are Korean chaebols, like Hyundai. Looking at how Hyundai Group's companies are connected to Hyundai Motor Company isn't really the right lens with which to examine Hyundai, but I'm going to use that lens anyway since most readers of this blog are probably already familiar with Hyundai Motor and will not be familiar with how Korean chaebols operate.

Speaking very roughly, with many exceptions, American companies have tended to take the advice to specialize and concentrate on their competencies, at least since the 80s. This is the opposite of the direction that Korean chaebols have gone. Hyundai not only makes cars, they make the steel their cars use, the robots they use to automate production, the cement used for their factories, the construction equipment used to build their factories, the containers and ships used to ship cars (which they also operate), the transmissions for their cars, etc.

If we look at a particular component, say, their 8-speed transmission vs. the widely used and lauded ZF 8HP transmission, reviewers typically slightly prefer the ZF transmission. But even so, having good-enough in-house transmissions, as well as many other in-house components that companies would typically buy, doesn't exactly seem to be a disadvantage for Hyundai.
^[return]

Measurement, benchmarking, and data analysis are underrated

Fri, 27 Aug 2021 00:00:00 +0000

A question I get asked with some frequency is: why bother measuring X, why not build something instead? More bluntly, in a recent conversation with a newsletter author, his comment on some future measurement projects I wanted to do (in the same vein as other projects like keyboard vs. mouse, keyboard, terminal and end-to-end latency measurements), delivered with a smug look and a bit contempt in the tone, was "so you just want to get to the top of Hacker News?"

The implication for the former is that measuring is less valuable than building and for the latter that measuring isn't valuable at all (perhaps other than for fame), but I don't see measuring as lesser let alone worthless. If anything, because measurement is, like writing, not generally valued, it's much easier to find high ROI measurement projects than high ROI building projects.

Let's start by looking at a few examples of high impact measurement projects. My go-to example for this is Kyle Kingsbury's work with Jepsen. Before Jepsen, a handful of huge companies (the now $1T+ companies that people are calling "hyperscalers") had decently tested distributed systems. They mostly didn't talk about testing methods in a way that really caused the knowledge to spread to the broader industry. Outside of those companies, most distributed systems were, by my standards, not particularly well tested.

At the time, a common pattern in online discussions of distributed correctness was:

Person A: Database X corrupted my data.
Person B: It works for me. It's never corrupted my data.
A: How do you know? Do you ever check for data corruption?
B: What do you mean? I'd know if we had data corruption (alternate answer: sure, we sometimes have data corruption, but it's probably a hardware problem and therefore not our fault)

Kyle's early work found critical flaws in nearly everything he tested, despite Jepsen being much less sophisticated then than it is now:

Redis Cluster / Redis Sentinel: "we demonstrate Redis losing 56% of writes during a partition"
MongoDB: "In this post, we’ll see MongoDB drop a phenomenal amount of data"
Riak: "we’ll see how last-write-wins in Riak can lead to unbounded data loss"
NuoDB: "If you are considering using NuoDB, be advised that the project’s marketing and documentation may exceed its present capabilities"
Zookeeper: the one early Jepsen test of a distributed system that didn't find a catastrophic bug
RabbitMQ clustering: "RabbitMQ lost ~35% of acknowledged writes ... This is not a theoretical problem. I know of at least two RabbitMQ deployments which have hit this in production."
etcd & Consul: "etcd’s registers are not linearizable . . . 'consistent' reads in Consul return the local state of any node that considers itself a leader, allowing stale reads."
ElasticSearch: "the health endpoint will lie. It’s happy to report a green cluster during split-brain scenarios . . . 645 out of 1961 writes acknowledged then lost."

Many of these problems had existed for quite a while

What’s really surprising about this problem is that it’s gone unaddressed for so long. The original issue was reported in July 2012; almost two full years ago. There’s no discussion on the website, nothing in the documentation, and users going through Elasticsearch training have told me these problems weren’t mentioned in their classes.

Kyle then quotes a number of users who ran into issues into production and then dryly notes

Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present

Although we don't have an A/B test of universes where Kyle exists vs. not and can't say how long it would've taken for distributed systems to get serious about correctness in a universe where Kyle didn't exist, from having spent many years looking at how developers treat correctness bugs, I would bet on distributed systems having rampant correctness problems until someone like Kyle came along. The typical response that I've seen when a catastrophic bug is reported is that the project maintainers will assume that the bug report is incorrect (and you can see many examples of this if you look at responses from the first few years of Kyle's work). When the reporter doesn't have a repro for the bug, which is quite common when it comes to distributed systems, the bug will be written off as non-existent.

When the reporter does have a repro, the next line of defense is to argue that the behavior is fine (you can also see many examples of these from looking at responses to Kyle's work). Once the bug is acknowledged as real, the next defense is to argue that the bug doesn't need to be fixed because it's so uncommon (e.g., "It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?"). And then, after it's acknowledged that the bug should be fixed, the final line of defense is to argue that the project takes correctness very seriously and there's really nothing more that could have been done; development and test methodology doesn't need to change because it was just a fluke that the bug occurred, and analogous bugs won't occur in the future without changes in methodology.

Kyle's work blew through these defenses and, without something like it, my opinion is that we'd still see these as the main defense used against distributed systems bugs (as opposed to test methodologies that can actually produce pretty reliable systems).

That's one particular example, but I find that it's generally true that, in areas where no one is publishing measurements/benchmarks of products, the products are generally sub-optimal, often in ways that are relatively straightforward to fix once measured. Here are a few examples:

Keyboards: after I published this post on keyboard latency, at least one major manufacturer that advertises high-speed gaming devices actually started optimizing input device latency. At the time, so few people measured keyboard latency that I could only find one other person who'd done a measurement (I wanted to look for other measurements because my measured results seemed so high as to be implausible, and the one measurement I could find online was in the same range as my measurements). Now, every major manufacturer of gaming keyboards and mice has fairly low latency devices available whereas, before, companies making gaming devices were focused on buzzword optimizations that had little to no impact (like higher speed USB polling)
Computers: after I published some other posts on computer latency, an engineer at a major software company that wasn't previously doing serious UI latency work told me that some engineers had started measuring and optimizing UI latency; also, the author of alacritty filed this ticket on how to reduce alacritty latency
Vehicle headlights: Jennifer Stockburger has noted that, when Consumer Reports started testing headlights, engineers at auto manufacturers thanked CR for giving them the ammunition they needed to make headlights more effective; previously, they would lose the argument to designers who wanted nicer looking but less effective headlights since making cars safer by designing better headlights is a hard sell because there's no business case, but making cars score higher on Consumer Reports reviews allowed them to sometimes win the argument. Without third-party measurements, a business oriented car exec has no reason to listen to engineers because almost no new car buyers will do anything resembling decent testing of well their headlights illuminate the road and even fewer buyers will test how much the headlights blind oncoming drivers, so designers are left unchecked to create the product they think looks best regardless of effectiveness
Vehicle ABS: after Consumer Reports and Car and Driver found that the Tesla Model 3 had extremely long braking distances (152 ft. from 60mph and 196 ft. from 70mph), Tesla updated the algorithms used to modulate the brakes, which improved braking distances enough that Tesla went from worst in class to better than average
Vehicle impact safety: Other than Volvo, car manufacturers generally design their cars to get the highest possible score on published crash tests; they'll add safety as necessary to score well on new tests when they're published, but not before

Anyone could've done the projects above (while Consumer Reports buys the cars they test, some nascent car reviewers rent cars on Turo)!

This post has explained why measuring things is valuable but, to be honest, the impetus for my measurements is curiosity. I just want to know the answer to a question. I did this long before I had a blog and I often don't write up my results even now that I have a blog. But even if you have no curiosity about what's actually happening when you interact with the world and you're "just" looking for something useful to do, the lack of measurements of almost everything means that it's easy to find high ROI measurement projects, at least in terms of impact on the world — if you want to make money, building something is probably easier to monetize.

Appendix: "so you just want to get to the top of Hacker News?"

When I look at posts that I enjoy reading that make it to the top of HN, like Chris Fenton's projects or Oona Raisanen's projects, I think it's pretty clear that they're not motivated by HN or other fame since they were doing these interesting projects long before their blogs were a hit on HN or other social media. I don't know them, but if I had to guess why they do their projects, it's primarily because they find it fun to work on the kinds of projects they work on.

I obviously can't say that no one works on personal projects with the primary goal of hitting the top of HN but, as a motivation, it's so inconsistent with the most obvious explanations for the personal project content I read on HN (that someone is having fun, is curious, etc.) that I find it a bit mind boggling that someone would think this is a plausible imputed motivation.

Appendix: the motivation for my measurement posts

There's a sense in which it doesn't really matter why I decided to write these posts, but if I were reading someone else's post on this topic, I'd still be curious what got them writing, so here's what prompted me to write my measurement posts (which, for the purposes of this list, include posts where I collate data and don't do any direct measurement).

danluu.com/car-safety: I was thinking about buying a car and wanted to know if I should expect significant differences in safety between manufacturers given that cars mostly get top marks on tests done in the U.S.
- This wasn't included in the post because I thought it was too trivial to include (because the order of magnitude is obvious even without carrying out the computation), but I also computed the probability of dying in a car accident as well as the expected change in life expectancy between an old used car and a new-ish used car
danluu.com/cli-complexity: I had this idea when I saw something by Gary Berhardt where he showed off how to count the number of single-letter command line options that ls, which made me wonder if that was a recent change or not
danluu.com/overwatch-gender: I had just seen two gigantic reddit threads debating whether or not there's a gender bias in how women are treated in online games and figured that I could get data on the matter in less time than was spent by people writing comments in those threads
danluu.com/input-lag: I wanted to know if I could trust my feeling that modern computers that I use are much higher latency than older devices that I'd used
danluu.com/keyboard-latency: I wanted to know how much latency came from keyboards (display latency is already well tested by https://blurbusters.com)
danluu.com/bad-decisions: I saw a comment by someone in the rationality community defending bad baseball coaching decisions, saying that they're not a big deal because they only cost you maybe four games a year, which isn't a big deal and wanted to know how big a deal bad coaching decisions were
danluu.com/android-updates: I was curious how many insecure Android devices are out there due to most Android phones not being updatable
danluu.com/filesystem-errors: I was curious how much filesystems had improved with respect to data corruption errors found by a 2005 paper
danluu.com/term-latency: I felt like terminal benchmarks were all benchmarking something that's basically irrelevant to user experience (throughput) and wanted to know what it would look like if someone benchmarked something that might matter more; I also wanted to know if my feeling that iTerm2 was slow was real or my imagination
danluu.com/keyboard-v-mouse: the most widely cited sources for keyboard vs. mousing productivity were pretty obviously bogus as well as being stated with extremely high confidence; I wanted to see if non-bogus tests would turn up the same results or different results
danluu.com/web-bloat: I took a road trip across the U.S., where the web was basically unusable, and wanted to quantify the unusability of the web without access to very fast internet
danluu.com/bimodal-compensation: I was curious if we were seeing a hollowing out of mid-tier jobs in programming like we saw with law jobs
danluu.com/yegge-predictions: I had the impression that Steve Yegge made unusually good predictions about the future of tech and wanted to see of my impression was correct
danluu.com/postmortem-lessons: I wanted to see what data was out there on postmortem causes to see if I could change how I operate and become more effective
danluu.com/boring-languages: I was curious how much of the software I use was written in boring, old, languages
danluu.com/blog-ads: I was curious how much money I could make if I wanted to monetize the blog
danluu.com/everything-is-broken: I wanted to see if my impression of how many bugs I run into was correct. Many people told me that the idea that people run into a lot of software bugs on a regular basis was an illusion caused by selective memory and I wanted to know if that was the case for me or not
danluu.com/integer-overflow: I had a discussion with a language designer who was convinced that integer overflow checking was too expensive to do for an obviously bogus reason (because it's expensive if you do a benchmark that's 100% integer operations) and I wanted to see if my quick mental-math estimate of overhead was the right order of magnitude
danluu.com/octopress-speedup: after watching a talk by Dan Espeset, I wanted to know if there were easy optimizations I could do to my then-Octopress site
danluu.com/broken-builds: I had a series of discussions with someone who claimed that their project had very good build uptime despite it being broken regularly; I wanted to know if their claim was correct with respect to other, similar, projects
danluu.com/empirical-pl: I wanted to know what studies backed up claims from people who said that there was solid empirical proof of the superiority of "fancy" type systems
danluu.com/2choices-eviction: I was curious what would happen if "two random choices" was applied to cache eviction
danluu.com/gender-gap: I wanted to verify the claims in an article that claimed that there is no gender gap in tech salaries
danluu.com/3c-conflict: I wanted to create a simple example illustrating the impact of alignment on memory latency

BTW, writing up this list made me realize that a narrative I had in my head about how and when I started really looking at data seriously must be wrong. I thought that this was something that came out of my current job, but that clearly cannot be the case since a decent fraction of my posts from before my current job are about looking at data and/or measuring things (and I didn't even list some of the data-driven posts where I just read some papers and look at what data they present). After seeing the list above, I realized that I did projects like the above not only long before I had the job, but long before I had this blog.

Appendix: why you can't trust some reviews

One thing that both increases and decreases the impact of doing good measurements is that most measurements that are published aren't very good. This increases the personal value of understanding how to do good measurements and of doing good measurements, but it blunts the impact on other people, since people generally don't understand what makes measurements invalid and don't have a good algorithm for deciding which measurements to trust.

There are a variety of reasons that published measurements/reviews are often problematic. A major issue with reviews is that, in some industries, reviewers are highly dependent on manufacturers for review copies.

Car reviews are one of the most extreme examples of this. Consumer Reports is the only major reviewer that independently sources their cars, which often causes them to disagree with other reviewers since they'll try to buy the trim level of the car that most people buy, which is often quite different from the trim level reviewers are given by manufacturers and Consumer Reports generally manages to avoid reviewing cars that are unrepresentatively picked or tuned. There have been a couple where Consumer Reports reviewers (who also buy the cars) have said that they thought someone realized they worked for Consumer Reports and then said that they needed to keep the car overnight before giving them the car they'd just bought; when that's happened, the reviewer has walked away from the purchase.

There's pretty significant copy-to-copy variation between cars and the cars reviewers get tend to be ones that were picked to avoid cosmetic issues (paint problems, panel gaps, etc.) as well as checked for more serious issues. Additionally, cars can have their software and firmware tweaked (e.g., it's common knowledge that review copies of BMWs have an engine "tune" that would void your warranty if you modified your car similarly).

Also, because Consumer Reports isn't getting review copies from manufacturers, they don't have to pull their punches and can write reviews that are highly negative, something you rarely see from car magazines and don't often see from car youtubers, where you generally have to read between the lines to get an honest review since a review that explicitly mentions negative things about a car can mean losing access (the youtuber who goes by "savagegeese" has mentioned having trouble getting access to cars from some companies after giving honest reviews).

Camera lenses are another area where it's been documented that reviewers get unusually good copies of the item. There's tremendous copy-to-copy variation between lenses so vendors pick out good copies and let reviewers borrow those. In many cases (e.g., any of the FE mount ZA Zeiss lenses or the Zeiss lens on the RX-1), based on how many copies of a lens people need to try and return to get a good copy, it appears that the median copy of the lens has noticeable manufacturing defects and that, in expectation, perhaps one in ten lenses has no obvious defect (this could also occur if only a few copies were bad and those were serially returned, but very few photographers really check to see if their lens has issues due to manufacturing variation). Because it's so expensive to obtain a large number of lenses, the amount of copy-to-copy variation was unquantified until lensrentals started measuring it; they've found that different manufacturers can have very different levels of copy-to-copy variation, which I hope will apply pressure to lens makers that are currently selling a lot of bad lenses while selecting good ones to hand to reviewers.

Hard drives are yet another area where it's been documented that reviewers get copies of the item that aren't represnetative. Extreme Tech has reported, multiple times, that Adata, Crucial, and Western Digital have handed out review copies of SSDs that are not what you get as a consumer. One thing I find interesting about that case is that Extreme Tech says

Agreeing to review a manufacturer’s product is an extension of trust on all sides. The manufacturer providing the sample is trusting that the review will be of good quality, thorough, and objective. The reviewer is trusting the manufacturer to provide a sample that accurately reflects the performance, power consumption, and overall design of the final product. When readers arrive to read a review, they are trusting that the reviewer in question has actually tested the hardware and that any benchmarks published were fairly run.

This makes it sound like the reviewer's job is to take a trusted handed to them by the vendor and then run good benchmarks, absolving the reviewer of the responsibility of obtaining representative devices and ensuring that they're representative. I'm reminded of the SRE motto, "hope is not a strategy". Trusting vendors is not a strategy. We know that vendors will lie and cheat to look better at benchmarks. Saying that it's a vendor's fault for lying or cheating can shift the blame, but it won't result in reviews being accurate or useful to consumers.

While we've only discussed a few specific areas where there's published evidence that reviews cannot be trusted because they're compromised by companies, but this isn't anything specific to those industries. As consumers, we should expect that any review that isn't performed by a trusted, independent, agency, that purchases its own review copies has been compromised and is not representative of the median consumer experience.

Another issue with reviews is that most online reviews that are highly ranked in search are really just SEO affiliate farms.

A more general issue is that reviews are also affected by the exact same problem as items that are not reviewed: people generally can't tell which reviews are actually good and which are not, so review sites are selected on things other than the quality of the review. A prime example of this is Wirecutter, which is so popular among tech folks that noting that so many tech apartments in SF have identical Wirecutter recommended items is a tired joke. For people who haven't lived in SF, you can get a peek into the mindset by reading the comments on this post about how it's "impossible" to not buy the wirecutter recommendation for anything which is full of comments from people who re-assure that poster that, due to the high value of the poster's time, it would be irresponsible to do anything else.

The thing I find funny about this is that if you take benchmarking seriously (in any field) and just read the methodology for the median Wirecutter review, without even trying out the items reviewed you can see that the methodology is poor and that they'll generally select items that are mediocre and sometimes even worst in class. A thorough exploration of this really deserves its own post, but I'll cite one example of poorly reviewed items here: in https://benkuhn.net/vc, Ben Kuhn looked into how to create a nice video call experience, which included trying out a variety of microphones and webcams. Naturally, Ben tried Wirecutter's recommended microphone and webcam. The webcam was quite poor, no better than using the camera from an ancient 2014 iMac or his 2020 Macbook (and, to my eye, actually much worse; more on this later). And the microphone was roughly comparable to using the built-in microphone on his laptop.

I have a lot of experience with Wirecutter's recommended webcam because so many people have it and it is shockingly bad in a distinctive way. Ben noted that, if you look at a still image, the white balance is terrible when used in the house he was in, and if you talk to other people who've used the camera, that is a common problem. But the issue I find to be worse is that, if you look at the video, under many conditions (and I think most, given how often I see this), the webcam will refocus regularly, making the entire video flash out of and then back into focus (another issue is that it often focuses on the wrong thing, but that's less common and I don't see that one with everybody who I talk to who uses Wirecutter's recommended webcam). I actually just had a call yesterday with a friend of mine who was using a different setup than I'd normally seen him with, the mediocre but perfectly acceptable macbook webcam. His video was going in and out of focus every 10-30 seconds, so I asked him if he was using Wirecutter's recommended webcam and of course he was, because what other webcam would someone in tech buy that has the same problem?

This level of review quality is pretty typical for Wirecutter reviews and they appear to generally be the most respected and widely used review site among people in tech.

Appendix: capitalism

When I was in high school, there was a clique of proto-edgelords who did things like read The Bell Curve and argue its talking points to anyone who would listen.

One of their favorite topics was how the free market would naturally cause companies that make good products rise to the top and companies that make poor products to disappear, resulting in things generally being safe, a good value, and so on and so forth. I still commonly see this opinion espoused by people working in tech, including people who fill their condos with Wirecutter recommended items. I find the juxtaposition of people arguing that the market will generally result in products being good while they themselves buy overpriced garbage to be deliciously ironic. To be fair, it's not all overpriced garbage. Some of it is overpriced mediocrity and some of it is actually good; it's just that it's not too different from what you'd get if you just naively bought random stuff off of Amazon without reading third-party reviews.

Appendix: other examples of the impact of measurement (or lack thereof)

Electronic stability control
- Toyota RAV4: before reviews and after reviews
- Toyota Hilux before reviews and after reviews
- Nissan Rogue: major improvements after Consumer Reports found issues with stability control.
- Jeep Grand Cherokee: before reviews and after reviews
Some boring stuff at work: a year ago, I wrote this pair of posts on observability infrastructure at work. At the time, that work had driven 8 figures of cost savings and that's now well into the 9 figure range. This probably deserves its own post at some point, but the majority of the work was straightforward once someone could actually observe what's going on.
- Relatedly: after seeing a few issues impact production services, I wrote a little (5k LOC) parser to parse every line seen in various host-level logs as a check to see what issues were logged that we weren't catching in our metrics. This found major issues in clusters that weren't using an automated solution to catch and remediate host-level issues; for some clusters, over 90% of hosts were actively corrupting data or had a severe performance problem. This led to the creation of a new team to deal with issues like this
Tires
- Almost all manufacturers other than Michelin see severely reduced wet, snow, and ice, performance as the tire wears
  - Jason Fenske says that a technical reason for this (among others) is that the sipes that improve grip are generally not cut to the full depth because doing so significantly increases manufacturing cost because the device that cuts the sipes will need to be stronger as well as wear out faster
  - A non-technical reason for this is that a lot of published tire tests are done on new tires, so tire manufacturers can get nearly the same marketing benchmark value by creating only partial-depth sipes
- As Tire Rack has increased in prominence, some tire manufacturers have made their siping more multi-directional to improve handling while cornering instead of having siping mostly or only perpendicular to the direction of travel, which mostly only helps with acceleration and braking (Consumer Reports snow and ice scores are based on accelerating in a straight line on snow and braking in a straight line on ice, respectively, whereas Tire Rack's winter test scores emphasize all-around snow handling)
- An example of how measurement impact is bounded: Farrell Scott, the Project Category Manager for Michelin winter tires said that, when designing the successor to the Michelin X-ICE Xi3, one of the primary design criteria was to change how the tire looked because Michelin found that customers thought that the X-ICE Xi3, despite being up there with the Bridge Blizzak WS80 for being the best all-around winter tire (slightly better at some things, slightly worse at others), potential customers often chose other tires because they looked more like the popular conception of a winter tire, with "aggressive" looking tread blocks (this is one thing the famous Nokian Hakkapeliitta tire line was much better at). They also changed the name; instead of incrementing the number, the new tire was called Michelin X-ICE SNOW, to emphasize that the tire is suitable for snow as well as ice.
- Although some consumers do read reviews, many (and probably most) don't!
HDMI to USB converters for live video
- If you read the docs for the Camlink 4k, they note that the device should use bulk transfers on Windows and Isochronous transfers on Mac (if you use their software, it will automatically make this adjustment)
  - Fabian Giesen informed me that this may be for the same reason that, when some colleagues of his tested a particular USB3 device on Windows, only 1 out of 5 chipsets tested supported isochronous properly (the rest would do things like bluescreen or hang the machine)
- I've tried miscellaneous cheap HDMI to USB converters as alternatives to the Camlink 4k, and I have yet to find a cheap one that generally works across a wide variety of computers. They will generally work with at least one computer I have access to with at least one piece of software I want to use, but will simply not work or provide very distorted video in some cases. Perhaps someone should publish benchmarks on HDMI to USB converter quality!
HDMI to VGA converters
- Many of these get very hot and then overheat and stop working in 15 minutes to 2 hours. Some aren't even warm to the touch. Good luck figuring out which ones work!
Water filtration
- Brita claims that their "longlast" filters remove lead. However, two different Amazon reviewers indicated that they measured lead levels in contaminated water before and after and found that lead levels weren't reduced
- It used to be the case that water flows very slowly through "longast" filters and this was a common complaint of users who bought the filters. Now some (or perhaps all) "longlast" filters filter water much more quickly but don't filter to Brita's claimed levels of filtration
Sports refereeing
- Baseball umpires are famously bad at making correct calls and we've had the technology to make nearly flawless calls for decades, but many people argue that having humans make incorrect calls is "part of the game" and the game wouldn't be as real if computers were in the loop on calls
- Some sports have partially bowed to pressure to make correct rulings when possible, e.g., in football, NFL coaches started being allowed to challege two calls per game based on video footage in 1999 (3 starting in 2004, if the first two challenges were successful), copying the system that the niche USFL created in 1985
Storage containers
- Rubbermaid storage containers (Rougneck & Toughneck) used to be famous for their quality and durability. Of course, it was worth more in the short term to cut back on the materials used and strength of the containers, so another firm bought the brand and continues to use it, producing similar looking containers that are famous for buckling if you stack containers on top of each other, which is the entire point of the nestable / stackable containers. I haven't seen anyone really benchmark storage containers seriously for how well they handle load so, in general, you can't really tell if this is going to happen to you or not.
Speaker vibration isolation solutions
- Ethan Winer concludes that these are audiophile placebo

Thanks to Fabian Giesen, Ben Kuhn, Yuri Vishnevsky, @chordowl, Seth Newman, Justin Blank, Per Vognsen, John Hergenroeder, Pam Wolf, Ivan Echevarria, and Jamie Brandon for comments/corrections/discussion.

Against essential and accidental complexity

Tue, 29 Dec 2020 00:00:00 +0000

In the classic 1986 essay, No Silver Bullet, Fred Brooks argued that there is, in some sense, not that much that can be done to improve programmer productivity. His line of reasoning is that programming tasks contain a core of essential/conceptual¹ complexity that's fundamentally not amenable to attack by any potential advances in technology (such as languages or tooling). He then uses an Ahmdahl's law argument, saying that because 1/X of complexity is essential, it's impossible to ever get more than a factor of X improvement via technological improvements.

Towards the end of the essay, Brooks claims that at least 1/2 (most) of complexity in programming is essential, bounding the potential improvement remaining for all technological programming innovations combined to, at most, a factor of 2²:

All of the technological attacks on the accidents of the software process are fundamentally limited by the productivity equation:

Time of task = Sum over i { Frequency_i Time_i }

If, as I believe, the conceptual components of the task are now taking most of the time, then no amount of activity on the task components that are merely the expression of the concepts can give large productivity gains.

Brooks states a bound on how much programmer productivity can improve. But, in practice, to state this bound correctly, one would have to be able to conceive of problems that no one would reasonably attempt to solve due to the amount of friction involved in solving the problem with current technologies.

Without being able to predict the future, this is impossible to estimate. If we knew the future, it might turn out that there's some practical limit on how much computational power or storage programmers can productively use, bounding the resources available to a programmer, but getting a bound on the amount of accidental complexity would still require one to correctly reason about how programmers are going to be able to use zillions times more resources than are available today, which is so difficult we might as well call it impossible.

Moreover, for each class of tool that could exist, one would have to effectively anticipate all possible innovations. Brooks' strategy for this was to look at existing categories of tools and state, for each, that they would be ineffective or that they were effective but played out. This was wrong not only because it underestimated gains from classes of tools that didn't exist yet, weren't yet effective, or he wasn't familiar with (e.g., he writes off formal methods, but it doesn't even occur to him to mention fuzzers, static analysis tools that don't fully formally verify code, tools like valgrind, etc.) but also because Brooks thought that every class of tool where there was major improvement was played out and it turns out that none of them were. For example, Brooks wrote off programming languages as basically done, just before the rise of "scripting languages" as well as just before GC languages took over the vast majority of programming³. Although you will occasionally hear statements like this, not many people will volunteer to write a webapp in C because gains from modern languages can't be more than 2x over using a modern language.

Another one Brooks writes off is AI, saying "The techniques used for speech recognition seem to have little in common with those used for image recognition, and both are different from those used in expert systems". But, of course this is no longer true now — neural nets are highly effective for both image recognition and speech recognition. Whether or not they'll be highly effective as a programming tool is to be determined, but a lynchpin of Brooks's argument against AI has been invalidated and it's not a stretch to think that a greatly improved GPT-2 could give significant productivity gains to programmers. Of course, it's not reasonable to expect that Brooks could've foreseen neural nets becoming effective for both speech and image recognition, but that's exactly what makes it unreasonable for Brooks to write off all future advance in AI as well as every other field of computer science.

Brooks also underestimates gains from practices and tooling that enables practices. Just for example, looking at what old school programming gurus advocated, we have Ken Thompson arguing that language safety is useless and that bugs happen because people write fragile code, which they should not do if they don't want to have bugs and Jamie Zawinski arguing that, when on a tight deadline, automated testing is a waste of time and "there’s a lot to be said for just getting it right the first time" without testing. Brooks acknowledges the importance of testing, but the only possible improvement to testing that he mentions are expert systems that could make testing easier for beginners. If you look at the complexity of moderately large scale modern software projects, they're well beyond any software project that had been seen in the 80s. If you really think about what it would mean to approach these projects using old school correctness practices, I think the speedup from those sorts of practices to modern practices is infinite for a typical team since most teams using those practices would fail to produce a working product at all if presented with a problem that many big companies have independently solved, e.g., produce a distributed database with some stated SLO. Someone could dispute the infinite speedup claim, but anyone who's worked on a complex project that's serious about correctness will have used tools and techniques that result in massive development speedups, easily more than 2x compared to 80s practices, a possibility that didn't seem to occur to Brooks as it appears that Brooks thought that serious testing improvements were not possible due to the essential complexity involved in testing.

Another basic tooling/practice example would be version control. A version control system that multi-file commits, branches, automatic merging that generally works as long as devs don't touch the same lines, etc., is a fairly modern invention. During the 90s, Microsoft was at the cutting edge of software development and they didn't manage to get a version control system that supported the repo size they needed (30M LOC for Win2k development) and supported branches until after Win2k. Branches were simulated by simply copying the entire source tree and then manually attempting to merge copies of the source tree. Special approval was required to change the source tree and, due to the pain of manual merging, the entire Win2k team (5000 people, including 1400 devs and 1700 testers) could only merge 100 changes per day on a good day (0 on a bad day when the build team got stalled due to time spent fixing build breaks). This was a decade after Brooks was writing and there was still easily an order of magnitude speedup available from better version control tooling, test tooling and practices, machine speedups allowing faster testing, etc. Note that, in addition to not realizing that version control and test tooling would later result in massive productivity gains, Brooks claimed that hardware speedups wouldn't make developers significantly more productive even though hardware speed was noted to be a major limiting factor in Win2k development velocity. Brooks couldn't conceive of anyone building a project as complex as Win2k, which could really utilize faster hardware. Of course, using the tools and practices of Brooks's time, it was practically impossible to build as project as complex as Win2k, but tools and practices advanced so quickly that it was possible only a decade later even if development velocity moved in slow motion compared to what we're used to today due to "stone age" tools and practices.

To pick another sub-part of the above, Brooks didn't list CI/CD as a potential productivity improvement because Brooks couldn't even imagine ever having tools that could possibly enable modern build practices. Writing in 1995, Brooks mentions that someone from Microsoft told him that they build nightly. To that, Brooks says that it may be too much work to enable building (at least) once a day, noting that Bell Northern Research, quite reasonably, builds weekly. Shortly after Brooks wrote that, Google was founded and engineers at Google couldn't even imagine settling for a setup like Microsoft had, let alone building once a week. They had to build a lot of custom software to get a monorepo of Google's scale on to what would be considered modern practices today, but they were able to do it. A startup that I worked for that was founded in 1995 also built out its own CI infra that allowed for constant merging and building from HEAD because that's what anyone who was looking at what could be done instead of thinking that everything that could be done has been done would do. For large projects, just having CI/CD alone and maintaining a clean build over building weekly should easily be a 2x productivity improvement, large than would be possible if Brooks's claim that half of complexity was essential would allow for. It's good that engineers at Google, the startup I worked for, as well as many other places didn't believe that it wasn't possible to get a 2x improvement and actually built tools that enabled massive productivity improvements.

In some sense, looking at No Silver Bullet is quite similar to when we looked at Unix and found the Unix mavens saying that we should write software like they did in the 70s and that the languages they invented are as safe as any language can be. Long before computers were invented, elders have been telling the next generation that they've done everything that there is to be done and that the next generation won't be able to achieve more. In the computer age, we've seen countless similar predictions outside of programming as well, such as Cliff Stoll's now-infamous prediction that the internet wouldn't chagne anything:

Visionaries see a future of telecommuting workers, interactive libraries and multimedia classrooms. They speak of electronic town meetings and virtual communities. Commerce and business will shift from offices and malls to networks and modems. And the freedom of digital networks will make government more democratic.

Baloney. Do our computer pundits lack all common sense? The truth is no online database will replace your daily newspaper ... How about electronic publishing? Try reading a book on disc. At best, it's an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can't tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we'll soon buy books and newspapers straight over the Intenet. Uh, sure. ... Then there's cyberbusiness. We're promised instant catalog shopping—just point and click for great deals. We'll order airline tickets over the network, make restaurant reservations and negotiate sales contracts. Stores will become obselete. So how come my local mall does more business in an afternoon than the entire Internet handles in a month?

If you do a little search and replace, Stoll is saying the same thing Brooks did. Sure, technologies changed things in the past, but I can't imagine how new technologies would change things, so they simply won't.

Even without knowing any specifics about programming, we would be able to see that these kinds of arguments have not historically help up and have decent confidence that the elders are not, in fact, correct this time.

Brooks kept writing about software for quite a while after he was a practitioner, but didn't bother to keep up with what was happening in industry after moving into Academia in 1964, which is already obvious from the 1986 essay we looked at, but even more obvious if you look at his 2010 book, Design of Design, where he relies on the same examples he relied on in earlier essays and books, where the bulk of his new material comes from a house that he built. We've seen that programmers who try to generalize their knowledge to civil engineering generally make silly statements that any 2nd year civil engineering student can observe are false, and it turns out that trying to glean deep insights about software engineering design techniques from house building techniques doesn't work any better, but since Brooks didn't keep up with the industry, that's what he had to offer. While there are timeless insights that transcend era and industry, Brooks has very specific suggestions, e.g., running software teams like cocktail party surgical teams, which come from thinking about how one could improve on the development practices Brooks saw at IBM in the 50s. But it turns out the industry has moved well beyond IBM's 1950s software practices and ideas that are improvements over what IBM did in the 1950s aren't particularly useful 70 years later.

Going back to the main topic of this post and looking at the specifics of what he talks about with respect to accidental complexity with the benefit of hindsight, we can see that Brooks' 1986 claim that we've basically captured all the productivity gains high-level languages can provide isn't too different from an assembly language programmer saying the same thing in 1955, thinking that assembly is as good as any language can be⁴ and that his claims about other categories are similar. The main thing these claims demonstrate are a lack of imagination. When Brooks referred to conceptual complexity, he was referring to complexity of using the conceptual building blocks that Brooks was familiar with in 1986 (on problems that Brooks would've thought of as programming problems). There's no reason anyone should think that Brooks' 1986 conception of programming is fundamental any more than they should think that how an assembly programmer from 1955 thought was fundamental. People often make fun of the apocryphal "640k should be enough for anybody" quote, but Brooks saying that, across all categories of potential productivity improvement, we've done most of what's possible to do, is analogous and not apocryphal!

If we look at the future, the fraction of complexity that might be accidental is effectively unbounded. One might argue that, if we look at the present, these terms wouldn't be meaningless. But, while this will vary by domain, I've personally never worked on a non-trivial problem that isn't completely dominated by accidental complexity, making the concept of essential complexity meaningless on any problem I've worked on that's worth discussing.

Appendix: concrete problems

Let's see how this essential complexity claim holds for a couple of things I did recently at work:

scp from a bunch of hosts to read and download logs, and then parse the logs to understand the scope of a problem
Query two years of metrics data from every instance of every piece of software my employer has, for some classes of software and then generate a variety of plots that let me understand some questions I have about what our software is doing and how it's using computer resources

Logs

If we break this task down, we have

scp logs from a few hundred thousand machines to a local box
- used a Python script for this to get parallelism with more robust error handling than you'd get out of pssh/parallel-scp
- ~1 minute to write the script
do other work while logs download
parse downloaded logs (a few TB)
- used a Rust script for this, a few minutes to write (used Rust instead of Python for performance reasons here — just opening the logs and scanning each line with idiomatic Python was already slower than I'd want if I didn't want to farm the task out to multiple machines)

In 1986, perhaps I would have used telnet or ftp instead of scp. Modern scripting languages didn't exist yet (perl was created in 1987 and perl5, the first version that some argue is modern, was released in 1994), so writing code that would do this with parallelism and "good enough" error handling would have taken more than an order of magnitude more time than it takes today. In fact, I think just getting semi-decent error handling while managing a connection pool could have easily taken an order of magnitude longer than this entire task took me (not including time spent downloading logs in the background).

Next up would be parsing the logs. It's not fair to compare an absolute number like "1 TB", so let's just call this "enough that we care about performance" (we'll talk about scale in more detail in the metrics example). Today, we have our choice of high-performance languages where it's easy to write, fast, safe code and harness the power of libraries (e.g., a regexp library⁵) that make it easy to write a quick and dirty script to parse and classify logs, farming out the work to all of the cores on my computer (I think Zig would've also made this easy, but I used Rust because my team has a critical mass of Rust programmers).

In 1986, there would have been no comparable language, but more importantly, I wouldn't have been able to trivially find, download, and compile the appropriate libraries and would've had to write all of the parsing code by hand, turning a task that took a few minutes into a task that I'd be lucky to get done in an hour. Also, if I didn't know how to use the library or that I could use a library, I could easily find out how I should solve the problem on StackOverflow, which would massively reduce accidental complexity. Needless to say, there was no real equivalent to Googling for StackOverflow solutions in 1986.

Moreover, even today, this task, a pretty standard programmer devops/SRE task, after at least an order of magnitude speedup over the analogous task in 1986, is still nearly entirely accidental complexity.

If the data were exported into our metrics stack or if our centralized logging worked a bit differently, the entire task would be trivial. And if neither of those were true, but the log format were more uniform, I wouldn't have had to write any code after getting the logs; rg or ag would have been sufficient. If I look for how much time I spent on the essential conceptual core of the task, it's so small that it's hard to estimate.

Query metrics

We really only need one counter-example, but I think it's illustrative to look at a more complex task to see how Brooks' argument scales for a more involved task. If you'd like to skip this lengthy example, click here to skip to the next section.

We can view my metrics querying task as being made up of the following sub-tasks:

Write a set of Presto SQL queries that effectively scan on the order of 100 TB of data each, from a data set that would be on the order of 100 PB of data if I didn't maintain tables that only contain a subset of data that's relevant
- Maybe 30 seconds to write the first query and a few minutes for queries to finish, using on the order of 1 CPU-year of CPU time
Write some ggplot code to plot the various properties that I'm curious about
- Not sure how long this took; less time than the queries took to complete, so this didn't add to the total time of this task

The first of these tasks is so many orders of magnitude quicker to accomplish today that I'm not even able to hazard a guess to as to how much quicker it is today within one or two orders of magnitude, but let's break down the first task into component parts to get some idea about the ways in which the task has gotten easier.

It's not fair to port absolute numbers like 100 PB into 1986, but just the idea of having a pipeline that collects and persists comprehensive data analogous to the data I was looking at for a consumer software company (various data on the resource usage and efficiency of our software) would have been considered absurd in 1986. Here we see one fatal flaw in the concept of accidental essential complexity providing an upper bound on productivity improvements: tasks with too much accidental complexity wouldn't have even been considered possible. The limit on how much accidental complexity Brooks sees is really a limit of his imagination, not something fundamental.

Brooks explicitly dismisses increased computational power as something that will not improve productivity ("Well, how many MIPS can one use fruitfully?", more on this later), but both storage and CPU power (not to mention network speed and RAM) were sources of accidental complexity so large that they bounded the space of problems Brooks was able to conceive of.

In this example, let's say that we somehow had enough storage to keep the data we want to query in 1986. The next part would be to marshall on the order of 1 CPU-year worth of resources and have the query complete in minutes. As with the storage problem, this would have also been absurd in 1986⁶, so we've run into a second piece of non-essential complexity so large that it would stop a person from 1986 from thinking of this problem at all.

Next up would be writing the query. If I were writing for the Cray-2 and wanted to be productive, I probably would have written the queries in Cray's dialect of Fortran 77. Could I do that in less than 300 seconds per query? Not a chance; I couldn't even come close with Scala/Scalding and I think it would be a near thing even with Python/PySpark. This is the aspect where I think we see the smallest gain and we're still well above one order of magnitude here.

After we have the data processed, we have to generate the plots. Even with today's technology, I think not using ggplot would cost me at least 2x in terms of productivity. I've tried every major plotting library that's supposedly equivalent (in any language) and every library I've tried either has multiple show-stopping bugs rendering plots that I consider to be basic in ggplot or is so low-level that I lose more than 2x productivity by being forced to do stuff manually that would be trivial in ggplot. In 2020, the existence of a single library already saves me 2x on this one step. If we go back to 1986, before the concept of the grammar of graphics and any reasonable implementation, there's no way that I wouldn't lose at least two orders of magnitude of time on plotting even assuming some magical workstation hardware that was capable of doing the plotting operations I do in a reasonable amount of time (my machine is painfully slow at rendering the plots; a Cray-2 would not be able to do the rendering in anything resembling a reasonable timeframe).

The number of orders of magnitude of accidental complexity reduction for this problem from 1986 to today is so large I can't even estimate it and yet this problem still contains such a large fraction of accidental complexity that it's once again difficult to even guess at what fraction of complexity is essential. To write it all down all of the accidental complexity I can think of would require at least 20k words, but just to provide a bit of the flavor of the complexity, let me write down a few things.

SQL; this is one of those things that's superficially simple but actually extremely complex
- Also, Presto SQL
Arbitrary Presto limits, some of which are from Presto and some of which are from the specific ways we operate Presto and the version we're using
- There's an internal Presto data structure assert fail that gets triggered when I use both numeric_histogram and cross join unnest in a particular way. Because it's a waste of time to write the bug-exposing query, wait for it to fail, and then re-write it, I have a mental heuristic I use to guess, for any query that uses both constructs, whether or not I'll hit the bug and I apply it to avoid having to write two queries. If the heuristic applies, I'll instead write a more verbose query that's slower to execute instead of the more straightforward query
- We partition data by date, but Presto throws this away when I join tables, resulting in very large and therefore expensive joins when I join data across a long period of time even though, in principle, this could be a series of cheap joins; if the join is large enough to cause my query to blow up, I'll write what's essentially a little query compiler to execute day-by-day queries and then post-process the data as necessary instead of writing the naive query
  - There are a bunch of cases where some kind of optimization in the query will make the query feasible without having to break the query across days (e.g., if I want to join host-level metrics data with the table that contains what cluster a host is in, that's a very slow join across years of data, but I also know what kinds of hosts are in which clusters, which, in some cases, lets me filter hosts out of the host-level metrics data that's in there, like core count and total memory, which can make the larger input to this join small enough that the query can succeed without manually partitioning the query)
- We have a Presto cluster that's "fast" but has "low" memory limits a cluster that's "slow" but has "high" memory limits, so I mentally estimate how much per-node memory a query will need so that I can schedule it to the right cluster
- etc.
When, for performance reasons, I should compute the CDF or histogram in Presto vs. leaving it to the end for ggplot to compute
How much I need to downsample the data, if at all, for ggplot to be able to handle it, and how that may impact analyses
Arbitrary ggplot stuff
- roughly how many points I need to put in a scatterplot before I should stop using size = [number] and should switch to single-pixel plotting because plotting points as circles is too slow
- what the minimum allowable opacity for points is
- If I exceed the maximum density where you can see a gradient in a scatterplot due to this limit, how large I need to make the image to reduce the density appropriately (when I would do this instead of using a heatmap deserves its own post)
- etc.
All of the above is about tools that I use to write and examine queries, but there's also the mental model of all of the data issues that must be taken into account when writing the query in order to generate a valid result, which includes things like clock skew, Linux accounting bugs, issues with our metrics pipeline, issues with data due to problems in the underlying data sources, etc.
etc.

For each of Presto and ggplot I implicitly hold over a hundred things in my head to be able to get my queries and plots to work and I choose to use these because these are the lowest overhead tools that I know of that are available to me. If someone asked me to name the percentage of complexity I had to deal with that was essential, I'd say that it was so low that there's no way to even estimate it. For some queries, it's arguably zero — my work was necessary only because of some arbitrary quirk and there would be no work to do without the quirk. But even in cases where some kind of query seems necessary, I think it's unbelievable that essential complexity could have been more than 1% of the complexity I had to deal with.

Revisiting Brooks on computer performance, even though I deal with complexity due to the limitations of hardware performance in 2020 and would love to have faster computers today, Brooks wrote off faster hardware as pretty much not improving developer productivity in 1986:

What gains are to be expected for the software art from the certain and rapid increase in the power and memory capacity of the individual workstation? Well, how many MIPS can one use fruitfully? The composition and editing of programs and documents is fully supported by today’s speeds. Compiling could stand a boost, but a factor of 10 in machine speed would surely . . .

But this is wrong on at least two levels. First, if I had access to faster computers, a huge amount of my accidental complexity would go away (if computers were powerful enough, I wouldn't need complex tools like Presto; I could just run a query on my local computer). We have much faster computers now, but it's still true that having faster computers would make many involved engineering tasks trivial. As James Hague notes, in the mid-80s, writing a spellchecker was a serious engineering problem due to performance constraints.

Second, (just for example) ggplot only exists because computers are so fast. A common complaint from people who work on performance is that tool X has somewhere between two and ten orders of magnitude of inefficiency when you look at the fundamental operations it does vs. the speed of hardware today⁷. But what fraction of programmers can realize even one half of the potential performance of a modern multi-socket machine? I would guess fewer than one in a thousand and I would say certainly fewer than one in a hundred. And performance knowledge isn't independent of other knowledge — controlling for age and experience, it's negatively correlated with knowledge of non-"systems" domains since time spent learning about the esoteric accidental complexity necessary to realize half of the potential of a computer is time spent not learning about "directly" applicable domain knowledge. When we look software that requires a significant amount of domain knowledge (e.g., ggplot) or that's large enough that it requires a large team to implement (e.g., IntelliJ⁸), the vast majority of it wouldn't exist if machines were orders of magnitude slower and writing usable software required wringing most of the performance out of the machine. Luckily for us, hardware has gotten much faster, allowing the vast majority of developers to ignore performance-related accidental complexity and instead focus on all of the other accidental complexity necessary to be productive today.

Faster computers both reduce the amount of accidental complexity tool users run into as well as the amount of accidental complexity that tool creators need to deal with, allowing more productive tools to come into existence.

2022 Update

A lot of people have said that this post is wrong because Brooks was obviously saying X and Brooks did not mean the things I quoted in this post. But people state all sorts of different Xs for what Brooks really meant so, in aggregate, these counterarguments are self-refuting because they think that Brooks "obviously" meant one specific thing but, if it were so obvious, people wouldn't have so many different ideas of what Brooks meant.

This is, of course, inevitable when it comes to a Rorschach test essay like Brooks's essay, which states a wide variety of different and contradictory things.

Thanks to Peter Bhat Harkins, Ben Kuhn, Yuri Vishnevsky, Chris Granger, Wesley Aptekar-Cassels, Sophia Wisdom, Lifan Zeng, Scott Wolchok, Martin Horenovsky, @realcmb, Kevin Burke, Aaron Brown, @up_lurk, and Saul Pwanson for comments/corrections/discussion.

The accidents I discuss in the next section. First let us consider the essence

The essence of a software entity is a construct of interlocking concepts: data sets, relationships among data items, algorithms, and invocations of functions. This essence is abstract, in that the conceptual construct is the same under many different representations. It is nonetheless highly precise and richly detailed.

I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation. We still make syntax errors, to be sure; but they are fuzz compared to the conceptual errors in most systems.

^[return]
Curiously, he also claims, in the same essay, that no individual improvement can yield a 10x improvement within one decade. While this technically doesn't contradict his Ahmdal's law argument plus the claim that "most" (i.e., at least half) of complexity is essential/conceptual, it's unclear why he would include this claim as well.

When Brooks revisited his essay in 1995 in No Silver Bullet Refired, he claimed that he was correct by using the weakest form of the three claims he made in 1986, that within one decade, no single improvement would result in an order of magnitude improvement. However, he did then re-state the strongest form of the claim he made in 1986 and made it again in 1995, saying that this time, no set of technological improvements could improve productivity more than 2x, for real:

It is my opinion, and that is all, that the accidental or representational part of the work is now down to about half or less of the total. Since this fraction is a question of fact, its value could in principle be settled by measurement. Failing that, my estimate of it can be corrected by better informed and more current estimates. Significantly, no one who has written publicly or privately has asserted that the accidental part is as large as 9/10.

By the way, I find it interesting that he says that no one disputed this 9/10ths figure. Per the body of this post, I would put it at far above 9/10th for my day-to-day work and, if I were to try to solve the same problems in 1986, the fraction would have been so high that people wouldn't have even conceived of the problem. As a side effect of having worked in hardware for a decade, I've also done work that's not too different from what some people faced in 1986 (microcode, assembly & C written for DOS) and I would put that work as easily above 9/10th as well.

Another part of his follow-up that I find interesting is that he quotes Harel's "Biting the Silver Bullet" from 1992, which, among other things, argues that that decade deadline for an order of magnitude improvement is arbitrary. Brooks' response to this is

There are other reasons for the decade limit: the claims made for candidate bullets all have had a certain immediacy about them . . . We will surely make substantial progress over the next 40 years; an order of magnitude over 40 years is hardly magical.

But by Brooks' own words when he revisits the argument in 1995, if 9/10th of complexity is essential, it would be impossible to get more than an order of magnitude improvement from reducing it, with no caveat on the timespan:

"NSB" argues, indisputably, that if the accidental part of the work is less than 9/10 of the total, shrinking it to zero (which would take magic) will not give an order of magnitude productivity improvement.

Both his original essay and the 1995 follow-up are charismatically written and contain a sort of local logic, where each piece of the essay sounds somewhat reasonable if you don't think about it too hard and you forget everything else the essay says. As with the original, a pedant could argue that this is technically not incoherent — after all, Brooks could be saying:
- at most 9/10th of complexity is accidental (if we ignore the later 1/2 claim, which is the kind of suspension of memory/disbelief one must do to read the essay)
- it would not be surprising for us to eliminate 100% of accidental complexity after 40 years
While this is technically consistent (again, if we ignore the part that's inconsistent) and is a set of claims one could make, this would imply that 40 years from 1986, i.e., in 2026, it wouldn't be implausible for there to be literally zero room for any sort of productivity improvement from tooling, languages, or any other potential source of improvement. But this is absurd. If we look at other sections of Brooks' essay and combine their reasoning, we see other inconsistencies and absurdities.
^[return]
Another issue that we see here is Brooks' insistence on bright-line distinctions between categories. Essential vs. accidental complexity. "Types" of solutions, such as languages vs. "build vs. buy", etc.

Brooks admits that "build vs. buy" is one avenue of attack on essential complexity. Perhaps he would agree that buying a regexp package would reduce the essential complexity since that would allow me to avoid keeping all of the concepts associated with writing a parser in my head for simple tasks. But what if, instead of buying regexes, I used a language where they're bundled into the standard library or is otherwise distributed with the language? Or what if, instead of having to write my own concurrency primitives, those are bundled into the language? Or for that matter, what about an entire HTTP server? There is no bright-line distinction between what's in a library one can "buy" (for free in many cases nowadays) and one that's bundled into the language, so there cannot be a bright-line distinction between what gains a language provides and what gains can be "bought". But if there's no bright-line distinction here, then it's not possible to say that one of these can reduce essential complexity and the other can't and maintain a bright-line distinction between essential and accidental complexity (in a response to Brooks, Harel argued against there being a clear distinction in a response, and Brooks' response was to say that there there is, in fact, a bright-line distinction, although he provided no new argument).

Brooks' repeated insistence on these false distinctions means that the reasoning in the essay isn't composable. As we've already seen in another footnote, if you take reasoning from one part of the essay and apply it alongside reasoning from another part of the essay, it's easy to create absurd outcomes and sometimes outright contradictions.

I suspect this is one reason discussions about essential vs. accidental complexity are so muddled. It's not just that Brooks is being vague and handwave-y, he's actually not self-consistent, so there isn't and cannot be a coherent takeaway. Michael Feathers has noted that people are generally not able to correct identify essential complexity; as he says, One person’s essential complexity is another person’s accidental complexity.. This is exactly what we should expect from the essay, since people who have different parts of it in mind will end up with incompatible views.

This is also a problem when criticizing Brooks. Inevitably, someone will say that what Brooks really meant was something completely different. And that will be true. But Brooks will have meant something completely different while also having meant the things he said that I mention. In defense of the view I'm presenting in the body of the text here, it's a coherent view that one could have had in 1986. Many of Brooks' statements don't make sense even when considered as standalone statements, let alone when cross-referenced with the rest of his essay. For example, the statement that no single development will result in an order of magnitude improvement in the next decade. This statement is meaningless as Brooks does not define and no one can definitively say what a "single improvement" is. And, as mentioned above, Brooks' essay reads quite oddly and basically does not make sense if that's what he's trying to claim. Another issue with most other readings of Brooks is that those are positions that are also meaningless even if Brooks had done the work to make them well defined. Why does it matter if one single improvement or two result in an order of magnitude improvement. If it's two improvements, we'll use them both.
^[return]
And by the way, this didn't only happen in 1955. I've worked with people who, this century, told me that assembly is basically as productive as any high level language. This probably sounds ridiculous to almost every reader of this blog, but if you talk to people who spend all day writing microcode or assembly, you'll occasionally meet somebody who believes this.

Thinking that the tools you personally use are as good as it gets is an easy trap to fall into.
^[return]
Another quirk is that, while Brooks acknowledges that code re-use and libraries can increase productivity, he claims languages and tools are pretty much over, but both of these claims can't hold because there isn't a bright-line distinction between libraries and languages/tools. ^[return]
Let's arbitrarily use a Motorola 68k processor with an FP co-processor that could do 200 kFLOPS as a reference for how much power we might have in a consumer CPU (FLOPS is a bad metric for multiple reasons, but this is just to get an idea of what it would take to get 1 CPU-year of computational resources, and Brooks himself uses MIPS as a term as if it's meaningful). By comparison, the Cray-2 could achieve 1.9 GFLOPS, or roughly 10000x the performance (I think actually less if we were to do a comparable comparison instead of using non-comparable GFLOPS numbers, but let's be generous here). There are 525600 / 5 = 105120 five minute periods in a year, so to get 1 CPU year's worth of computation in five minutes we'd need 105120 / 10000 = 10 Cray-2s per query, not including the overhead of aggregating results across Cray-2s.

It's unreasonable to think that a consumer software company in 1986 would have enough Cray-2s lying around to allow for any random programmer to quickly run CPU years worth of queries whenever they wanted to do some data analysis. One sources claims that 27 Cray-2s were ever made over the production lifetime of the machine (1985 to 1990). Even if my employer owned all of them and they were all created by 1986, that still wouldn't be sufficient to allow the kind of ad hoc querying capacity that I have access to in 2020.

Today, someone at a startup can even make an analogous argument when comparing to a decade ago. You used to have to operate a cluster that would be prohibitively annoying for a startup to operate unless the startup is very specialized, but you can now just use Snowflake and basically get Presto but only pay for the computational power you use (plus a healthy markup) instead of paying to own a cluster and for all of the employees necessary to make sure the cluster is operable.
^[return]
I actually run into one of these every time I publish a new post. I write my posts in Google docs and then copy them into emacs running inside tmux running inside Alacritty. My posts are small enough to fit inside L2 cache, so I could have 64B/3.5 cycle write bandwidth. And yet, the copy+paste operation can take ~1 minute and is so slow I can watch the text get pasted in. Since my chip is working super hard to make sure the copy+paste happens, it's running at its full non-turbo frequency of 4.2Ghz, giving it 76.8GB/s of write bandwidth. For a 40kB post, 1 minute = 666B/s. 76.8G/666 =~ 8 orders of magnitude left on the table. ^[return]
In this specific case, I'm sure somebody will argue that Visual Studio was quite nice in 2000 and ran on much slower computers (and the debugger was arguably better than it is in the current version). But there was no comparable tool on Linux, nor was there anything comparable to today's options in the VSCode-like space of easy-to-learn programming editor that provides programming-specific facilities (as opposed to being a souped up version of notepad) without being a full-fledged IDE. ^[return]

How good can you be at Codenames without knowing any words?

Appendix: writing the code for the post

Appendix: the spirit of the game vs. playing to win

Appendix: word games you might like

Perfect Words(https://www.amazon.co.uk/TIKI-Editions-Perfect-Words-Intergenerational/dp/B0CHN8XP1F)

Word Blur

A discussion of discussions on AI bias

Appendix: technically, how hard is it to improve the situation?

Appendix: the story of this post

Appendix: comments from other folks

Appendix: reproducing Rob Ricci's results

Appendix: comments from Benjamin Reeseman

What the FTC got wrong in the Google antitrust investigation

Brief summary

Common errors

A few interesting tidbits

What might've happened

What can we do going forward?

Appendix: non-statements

BC Staff Memo

Executive Summary

I. HISTORY OF THE INVESTIGATION AND RELATED PROCEEDINGS

A. FTC INVESTIGATION

B. EUROPEAN COMMISSION INVESTIGATION

C. MULTI-STATE INVESTIGATION

D. PRIVATE LITIGATION

II. STATEMENT OF FACTS

A. THE PARTIES

1. Google

2. General search competitors

a. Microsoft

b. Yahoo

3. Major Vertical Competition

B. INDUSTRY BACKGROUND

1. General Search

2. Online Advertising

3. Syndicated Search and Search Advertising

4. Mobile Search

C. THE SIGNIFICANCE OF SCALE IN INTERNET SEARCH

1. Search Query Volume

2. Advertising Volume

3. Scale Curve

D. GOOGLE'S SUSPECT CONDUCT

1. Google's Preferencing of Google Vertical Properties Within Its Search Engine Results Page ("SERP")

a. Overview of Changes to Google's SERP

b. Google's Development and Introduction of Vertical Properties

c. The Evolution of Display of Google's Vertical Properties on the SERP

d. Google's Preferential Display of Google Vertical Properties on the SERP

e. Google's Demotion of Competing Vertical Websites

f. Effects of Google's SERP Changes on Vertical Rivals

2. Google's "Scraping" of Rivals' Vertical Content

a. The "Local" Story

b. The "Shopping" Story

c. Effects of Google's "Scraping" on Vertical Rivals

3. Google's API Restrictions

a. Overview of the AdWords Platform

b. The Restrictive Conditions

c. Effects of the Restrictive Conditions

i. Effects on Advertisers and Search Engine Marketers ("SEMs")

ii. Effects on Competitors

d. Internal Google Discussions Regarding the Restrictions

4. Google's Exclusive and Restrictive Syndication Agreements

a. Publishers and Market Structure

b. Development of the Market for Search Syndication

d. Effects of Exclusivity and Preferred Placement

i. Common Publisher Responses

ii. Publishers' Views of Exclusivity Provisions

iii.Effects on Competitors

III. LEGAL ANALYSIS

A. GOOGLE HAS MONOPOLY POWER IN RELEVANT MARKETS

1. Relevant Markets and Market Shares

a. Horizontal Search

b. Search Advertising

c. Syndicated Search and Search Advertising ("Search Intermediation")

2. Substantial Barriers to Entry Exist

a. Technology and Specialization

b. Substantial Upfront Investment

c. Scale Effects

d. Reputation, Brand Loyalty, and the "Halo Effect"

e. Exclusive and Restrictive Agreements -